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Abstract 

In this era of large-scale data, distributed systems built on top of clusters of commodity hard¬ 
ware provide cheap and reliable storage and scalable processing of massive data. With cheap 
storage, instead of storing only currently-relevant data, it is common to store as much data 
as possible, hoping that its value can be extracted later. In this way, exabytes (10^® bytes) of 
data are being created on a daily basis. Extracting value from these data however, requires 
scalable implementations of advanced analytical algorithms beyond simple data processing, 
e.g., statistical regression methods, linear algebra, and optimization algorithms. Many tra¬ 
ditional methods are designed to minimize floating-point operations, which is the dominant 
cost of in-memory computation on a single machine. In parallel and distributed environments, 
however, load balancing and communication, including disk and network I/O, can easily dom¬ 
inate computation. These factors greatly increase the complexity of algorithm design and 
challenge traditional ways of thinking about the design of parallel and distributed algorithms. 

Here, we review recent work on developing and implementing randomized matrix algo¬ 
rithms in large-scale parallel and distributed environments. Randomized algorithms for ma¬ 
trix problems have received a great deal of attention in recent years, thus far typically either in 
theory or in machine learning applications or with implementations on a single machine. Our 
main focus is on the underlying theory and practical implementation of random projection 
and random sampling algorithms for very large very overdetermined (be., overconstrained) i\ 
and li regression problems. Randomization can be used in one of two related ways: either to 
construct sub-sampled problems that can be solved, exactly or approximately, with traditional 
numerical methods; or to construct preconditioned versions of the original full problem that 
are easier to solve with traditional iterative algorithms. Theoretical results demonstrate that 
in near input-sparsity time and with only a few passes through the data one can obtain very 
strong relative-error approximate solutions, with high probability. Empirical results highlight 
the importance of various trade-offs (e.;?., between the time to construct an embedding and 
the conditioning quality of the embedding, between the relative importance of computation 
versus communication, etc.) and demonstrate that and regression problems can be solved 
to low, medium, or high precision in existing distributed systems on up to terabyte-sized data. 
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1 Introduction 


Matrix algorithms lie at the heart of many applications, both historically in areas such as signal 
processing and scientific computing as well as more recently in areas such as machine learning 
and data analysis. Essentially, the reason is that matrices provide a convenient mathematical 
structure with which to model data arising in a broad range of applications: an mxn real-valued 
matrix A provides a natural structure for encoding information about m objects, each of which 
is described by n features. Alternatively, an n x n real-valued matrix A can be used to describe 
the correlations between all pairs of n data points, or the weighted edge-edge adjacency matrix 
structure of an n-node graph. In astronomy, for example, very small angular regions of the sky 
imaged at a range of electromagnetic frequency bands can be represented as a matrix—in that 
case, an object is a region and the features are the elements of the frequency bands. Similarly, in 
genetics, DNA SNP (Single Nucleotide Polymorphism) or DNA microarray expression data can 
be represented in such a framework, with Aij representing the expression level of the gene or 
SNP in the experimental condition or individual. Similarly, term-document matrices can be 
constructed in many Internet applications, with A^j indicating the frequency of the term in 
the document. 

Most traditional algorithms for matrix problems are designed to run on a single machine, 
focusing on minimizing the number of floating-point operations per second (FLOPS). On the other 
hand, motivated by the ability to generate very large quantities of data in relatively automated 
ways, analyzing data sets of billions or more of records has now become a regular task in many 
companies and institutions. In a distributed computational environment, which is typical in 
these applications, communication costs, e.g., between different machines, are often much more 
important than computational costs. What is more, if the data cannot fit into memory on a 
single machine, then one must scan the records from secondary storage, e.g., hard disk, which 
makes each pass through the data associated with enormous I/O costs. Given that, in many of 
these large-scale applications, regression, low-rank approximation, and related matrix problems 
are ubiquitous, the fast computation of their solutions on large-scale data platforms is of interest. 

In this paper, we will provide an overview of recent work in Randomized Numerical Linear 
Algebra (RandNLA) on implementing randomized matrix algorithms in large-scale parallel and 
distributed computational environments. RandNLA is a large area that applies randomization as 
an algorithmic resource to develop improved algorithms for regression, low-rank matrix approx¬ 
imation, and related problems [T]. To limit the presentation, here we will be most interested in 
very large very rectangular linear regression problems on up to terabyte-sized data: in particular, 
in the £2 regression (also known as least squares, or LS) problem and its robust alternative, the ii 
regression (also known as least absolute deviations, LAD, or least absolute errors, LAE) problem, 
with strongly rectangular “tall” data. Although our main focus is on (.2 and £i regression, much 
of the underlying theory holds for ip regression, either for p G [1)2] or for all p G [1, 00 ), and thus 
for simplicity we formulate many of our results in ip. 

Several important conclusions will emerge from our presentation. 

• First, many of the basic ideas from RandNLA in RAM extend to RandNLA in parallel/dis¬ 
tributed environments in a relatively straightforward manner, assuming that one is more 
concerned about communication than computation. This is important from an algorithm 
design perspective, as it highlights which aspects of these RandNLA algorithms are pe¬ 
culiar to the use of randomization and which aspects are peculiar to parallel/distributed 
environments. 

• Second, with appropriate engineering of random sampling and random projection algo¬ 
rithms, it is possible to compute good approximate solutions—to low precision {e.g., 1 or 2 
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m 

n 

SNP 

number of SNPs (iC) 

number of subjects ( 10 ®) 

Tinylmages 

number of images ( 10 ®) 

number of pixels in each image ( 10 ®) 

PDF 

number of degrees of freedom 

number of time steps 

sensor network 

size of sensing data 

number of sensors 

NLP 

number of words and n-grams 

number of documents 

tick data 

number of ticks 

number of stocks 


Table 1: Examples of strongly rectangular datasets 


digits of precision), medium precision (e. 5 ., 3 or 4 digits of precision), or high precision {e.g., 
up to machine precision)—to several common matrix problems in only a few passes over the 
original matrix on up to terabyte-sized data. While low precision is certainly appropriate 
for many data analysis and machine learning applications involving noisy input data, the 
appropriate level of precision is a choice for user of an algorithm to make; and there are 
obvious advantages to having the developer of an algorithm provide control to the user on 
the quality of the answer returned by the algorithm. 

• Third, the design principles for developing high-quality RandNLA matrix algorithms de¬ 
pend strongly on whether one is interested in low, medium, or high precision. (An example 
of this is whether to solve the randomized subproblem with a traditional method or to 
use the randomized subproblem to create a preconditioned version of the original prob¬ 
lem.) Understanding these principles, the connections between them, and how they relate 
to traditional principles of NLA algorithm design is important for providing high-quality 
implementations of recent theoretical developments in the RandNLA literature. 

Although many of the ideas we will discuss can be extended to related matrix problems such 
as low-rank matrix approximation, there are two main reasons for restricting attention to strongly 
rectangular data. The first, most obvious, reason is that strongly rectangular data arises in many 
fields to which machine learning and data analysis methods are routinely applied. Consider, e.g., 
Table dl which lists a few examples. 

• In genetics, single nucleotide polymorphisms (SNPs) are important in the study of human 
health. There are roughly 10 million SNPs in the human genome. However, there are 
typically at most a few thousand subjects for a study of a certain type of disease, due to 
the high cost of determination of genotypes and limited number of target subjects. 

• In Internet applications, strongly rectangular datasets are common. For example, the image 
dataset called Tinylmages [2] which contains 80 million images of size 32 x 32 collected from 
Internet. 

• In spatial discretization of high-dimensional partial differential equations (PDFs), the num¬ 
ber of degrees of freedom grows exponentially as dimension increases. For 3D problems, 
it is common that the number of degrees of freedom reaches 10 ®, for example, by having 
a 1000 X 1000 X 1000 discretization of a cubic domain. However, for a time-dependent 
problem, time stays one-dimensional. Though depending on spatial discretization {e.g., 
the Courant-Friedrichs-Lewy condition for hyperbolic PDEs), the number of time steps is 
usually much less than the number of degrees of freedoms in spatial discretization. 

• In geophysical applications, especially in seismology, the number of sensors is much less 
than the number of data points each sensor collects. For example, Werner-Allen et al. [3] 
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deployed three wireless sensors to monitor volcanic eruptions. In 54 hours, each sensor sent 
back approximately 20 million packets. 


In natural language processing (NLP), the number of documents is much less than the 
number of n-grams, which grows geometrically as n increases. For example, the webspanfl 
dataset contains 350,000 documents and 254 unigrams, but 680,715 trigrams. 


• In high-frequency trading, the number of relevant stocks is much less than the number of 
ticks, changes to the best bid and ask. For example, in 2012 ISE Historical Options Tick 
Datso has daily files with average size greater than 100GB uncompressed. 

A second, less obvious, reason for restricting attention to strongly rectangular data is that many 
of the algorithmic methods that are developed for them (both the RandNLA methods we will 
review as well as deterministic NLA methods that have been used traditionally) have extensions 
to low-rank matrix approximation and to related problems on more general “fat” matrices. For 
example, many of the methods for SVD-based low-rank approximation and related rank-revealing 
QR decompositions of general matrices have strong connections to QR decomposition methods 
for rectangular matrices; and, similarly, many of the methods for more general linear and convex 
programming arise in special {e.g., ii regression) linear programming problems. Thus, they are 
a good problem class to consider the development of matrix algorithms (either in general or for 
RandNLA algorithms) in parallel and distributed environments. 

It is worth emphasizing that the phrase “parallel and distributed” can mean quite different 
things to different research communities, in particular to what might be termed HPC (high 
performance computing) or scientific computing researchers versus data analytics or database or 
distributed data systems researchers. There are important technical and cultural differences here, 
but there are also some important similarities. For example, to achieve parallelism, one can use 
multi-threading on a shared-memory machine, or one can use message passing on a multi-node 
cluster. Alternatively, to process massive data on large commodity clusters, Google’s MapReduce 
[4] describes a computational framework for distributed computation with fault tolerance. For 
computation not requiring any internode communication, one can achieve even better parallelism. 
We don’t want to dwell on many of these important details here: this is a complicated and evolving 
space; and no doubt the details of the implementation of many widely-used algorithms will evolve 
as the space evolves. To give the interested reader a quick sense of some of these issues, though, 
here we provide a very high-level representative description of parallel environments and how 
they scale. As one goes down this list, one tends to get larger and larger. 


name 

cores 

memory 

notes 

Shared memory 

Message passing 

MapReduce 
Distributed computing 

[10, 10^11 
[200, 10^11 
[40,10*5 
[-,3 X lO^ll 

[100GB,100TB] 

[1TB, 1000TB] 

[240GB, 100TB] 

GUDA cores: [5 x 10^,3 x 10®]^ 
GPU memory: [500GB, 20TB] 
storage: [100TB, lOOPB]^ 


Table 2: High-level representative description of parallel environments. 


Tttp://www.csie.ntu.edu.tw/-cjlin/libsvmtools/datasets/binary.html 

^http://www.ise.com/hotdata 

^http://www.sgi.com/pdfs/4358.pdf 

^http://www.top500.org/list/2011/11/100 

'http://i.top500.org/site/50310 
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In addition, it is also worth emphasizing that there is a great deal of related work in parallel 
and distributed computing, both in numerical linear algebra as well as more generally in scientific 
computing. For example. Valiant has provided a widely-used model for parallel computation [5]; 
Aggarwal et al. have analyzed the communication complexity of PRAMs [6] ; Lint and Agerwala 
have highlighted communication issues that arise in the design of parallel algorithms [7]; Heller 
has surveyed parallel algorithms in numerical linear algebra [8]; Toledo has provided a survey of 
out-of-core algorithms in numerical linear algebra [D]; Ballard et al. have focused on developing 
algorithms for minimizing communication in numerical linear algebra [10]; and Bertsekas and 
Tsitsiklis have surveyed parallel and distributed iterative algorithms m- We expect that some 
of the most interesting developments in upcoming years will involve coupling the ideas for imple¬ 
menting RandNLA algorithms in parallel and distributed environments that we describe in this 
review with these more traditional ideas for performing parallel and distributed computation. 

In the next section. Section O we will review the basic ideas underlying RandNLA methods, 
as they have been developed in the special case of £2 regression in the RAM model. Then, in 
Section [3l we will provide notation, some background and preliminaries on £2 and more general 
ip regression problems, as well as traditional methods for their solution. Then, in Section HI we 
will describe rounding and embedding methods that are used in a critical manner by RandNLA 
algorithms; and in Section [S] we will review recent empirical results on implementing these ideas 
to solve up to terabyte-sized £2 and ii regression problems. Finally, in Section [G] we will provide a 
brief discussion and conclusion. An overview of the general RandNLA area has been provided [1], 
and we refer the interested reader to this overview. In addition, two other reviews are available 
to the interested reader: an overview of how RandNLA methods can be coupled with traditional 
NLA algorithms for low-rank matrix approximation [T2|; and an overview of how data-oblivious 
subspace embedding methods are used in RandNLA m- 

2 RandNLA in RAM 

In this section, we will highlight several core ideas that have been central to prior work in 
RandNLA in (theory and/or practice in) RAM that we will see are also important as design 
principles for extending RandNLA methods to larger-scale parallel and distributed environments. 
We start in Section 12.11 by describing a prototypical example of a RandNLA algorithm for the 
very overdetermined LS problem; then we describe in Section [22] two problem-specihc complexity 
measures that are important for low-precision and high-precision solutions to matrix problems, re¬ 
spectively, as well as two complementary ways in which randomization can be used by RandNLA 
algorithms; and we conclude in Section [2.31 with a brief discussion of running time considerations. 

2.1 A meta-algorithm for RandNLA 

A prototypical example of the RandNLA approach is given by the following meta-algorithm for 
very overdetermined LS problems OHKiailSI. In particular, the problem of interest is to solve: 

min ||Ax — 6 II 2 . ( 1 ) 

X 

The following meta-algorithm takes as input an m x n matrix A, where m S> n, a vector 5, and 
a probability distribution and it returns as output an approximate solution x, which is 

an estimate of the exact answer x* * of Problem 

^http: //www. cloudera. com/blog/2010/04/pushing-the-limits-of-distributed-processing/ 

'http://hortonworks.com/blog/an-introduction-to-hdfs-federation/ 

*http://folding.stanford.edu/ 


5 



• Randomly sampling. Randomly sample r > n constraints, i.e., rows of A and the 

corresponding elements of h, using as an importance sampling distribution. 

• Subproblem construction. Rescale each sampled row/element by l/(r 7 rj) to form a 
weighted LS subproblem. 

• Solving the subproblem. Solve the weighted LS subproblem, formally given in ([ 2 ]) below, 
and then return the solution x. 

It is convenient to describe this meta-algorithm in terms of a random “sampling matrix” S, in the 
following manner. If we draw r samples (rows or constraints or data points) with replacement, 
then define an r x m sampling matrix, S, where each of the r rows of S has one non-zero element 
indicating which row of A (and element of b) is chosen in a given random trial. In this case, the 
(i, element of S equals Xj if the data point is chosen in the random trial (meaning, 
in particular, that every non-zero element of S equals -s/n/r for sampling uniformly at random). 
With this notation, this meta-algorithm constructs and solves the weighted LS estimator: 

X = argmin IIS'Ax — S' 6 || 2 . ( 2 ) 

X 

Since this meta-algorithm samples constraints and not variables, the dimensionality of the 
vector X that solves the (still over constrained, but smaller) weighted LS subproblem is the same 
as that of the vector x* that solves the original LS problem. The former may thus be taken as an 
approximation of the latter, where, of course, the quality of the approximation depends critically 
on the choice of Although uniform subsampling (with or without replacement) is very 

simple to implement, it is easy to construct examples where it will perform very poorly naEttS]. 
On the other hand, it has been shown that, for a parameter 7 G (0,1] that can be tuned, if 

vTj > 7 —, and r = 0{p\og{p)/ (3) 
P 

where the so-called statistical leverage scores ha are dehned in ([ 6 ]) below, i.e., if one draws the 
sample according to an importance sampling distribution that is proportional to the leverage 
scores of A, then with constant probability (that can be easily boosted to probability 1 — <5, for 
any 5 > 0 ) the following relative-error bounds hold: 

\\b-Ax \\2 < {I + e)\\b - Ax *\\2 and (4) 

lk*-x||2 < \/i (^K{A)y /^-2 _ ||a;*||2^ (5) 

where k{A) is the condition number of A and where ^ = ||RR'^ 6 || 2 /||ft ||2 is a parameter dehning 
the amount of the mass of b inside the column space of A MmM- 

Due to the crucial role of the statistical leverage scores in ([3]), this canonical RandNLA 
procedure has been referred to as the algorithmic leveraging approach to approximating LS ap¬ 
proximation m- In addition, although this meta-algorithm has been described here only for 
very overdetermined LS problems, it generalizes to other linear regression problems and low-rank 
matrix approximation problems on less rectangular matrice^ [T71 HU [T9l [ 20 l [ 2 T] . 

®Let j 4 be a matrix with dimension m by n where m > n. A less rectangular matrix is a matrix that has smaller 
m/n. 
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2.2 Leveraging, conditioning, and nsing randomization 

Leveraging and conditioning refer to two types of problem-specific complexity measures, Le., 
quantities that can be computed for any problem instance that characterize how difficult that 
problem instance is for a particular class of algorithms. Understanding these, as well as different 
uses of randomization in algorithm design, is important for designing RandNLA algorithms, both 
in theory and/or practice in RAM as well as in larger parallel and distributed environments. For 
now, we describe these in the context of very overdetermined LS problems. 

• Statistical leverage. (Related to eigenvectors; important for obtaining low-precision so¬ 
lutions.) If we let H = A[A^A)~^A^, where the inverse can be replaced with the Moore- 
Penrose pseudoinverse if A is rank deficient, be the projection matrix onto the column span 
of A, then the diagonal element of H, 

hii = A(j^{A^A) ( 6 ) 

where A(j) is the row of A, is the statistical leverage of observation or sample. Since 
H can alternatively be expressed as H = UU'^, where U is any orthogonal basis for the 
column space of X, e.g., the Q matrix from a QR decomposition or the matrix of left singular 
vectors from the thin SVD, the leverage of the i^^ observation can also be expressed as 

n 

= = ( 7 ) 

i=i 

where 17(j) is the row of U. Leverage scores provide a notion of “coherence” or “outlier- 
ness,” in that they measure how well-correlated the singular vectors are with the canonical 
basis [THl ESI |22] as well as which rows/constraints have largest “influence” on the LS 
fit [23l 1231 [251126] . Computing the leverage scores {fin}™! exactly is generally as hard as 
solving the original LS problem (but 1 ± e approximations to them can be computed more 
quickly, for arbitrary input matrices |15jL 

Leverage scores are important from an algorithm design perspective since they define the 
key nonuniformity structure needed to control the complexity of high-quality random sam¬ 
pling algorithms. In particular, naive uniform random sampling algorithms perform poorly 
when the leverage scores are very nonuniform, while randomly sampling in a manner that 
depends on the leverage scores leads to high-quality solutions. Thus, in designing RandNLA 
algorithms, whether in RAM or in parallel-distributed environments, one must either quickly 
compute approximations to the leverage scores or quickly preprocess the input matrix so they 
are nearly uniformized—in which case uniform random sampling on the preprocessed matrix 
performs well. 

Informally, the leverage scores characterize where in the high-dimensional Euclidean space 
the (singular value) information in A is being sent, i.e., how the quadratic well (with aspect 
ratio k(A) that is implicitly defined by the matrix A) “sits” with respect to the canonical 
axes of the high-dimensional Euclidean space. If one is interested in obtaining low-precision 
solutions, e.g., e = 10“^, that can be obtained by an algorithm that provides 1 ± e relative- 
error approximations for a fixed value of e but whose e dependence is polynomial inlje, then 
the key quantities that must be dealt with are statistical leverage scores of the input data. 

• Condition number. (Related to eigenvalues; important for obtaining high-precision so¬ 
lutions.) If we let (Tmax(A) and (Tmin(A) denote the largest and smallest nonzero singular 
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values of A, respectively, then k{A) = UmaxC^)/(^) is the ^ 2 -norm condition number 
of A which is formally defined in Definition [3l Computing k{A) exactly is generally as 
hard as solving the original LS problem. The condition number n{A) is important from an 
algorithm design perspective since k,{A) defines the key nonuniformity structure needed to 
control the complexity of high-precision iterative algorithms, i.e., it bounds the number of 
iterations needed for iterative methods to converge. In particular, for ill-conditioned prob¬ 
lems, e.g., if k{A) k, 10® ^ 1, then the convergence speed of iterative methods is very slow, 
while if K > 1 then iterative algorithms converge very quickly. Informally, k,{A) defines the 
aspect ratio of the quadratic well implicitly defined by A in the high-dimensional Euclidean 
space. If one is interested in obtaining high-precision solutions, e.g., e = 10“^®, that can he 
obtained by iterating a low-precision solution to high precision with an iterative algorithm 
that converges as log(l/e), then the key quantity that must he dealt with is the condition 
number of the input data. 

• Monte Carlo versus Las Vegas uses of randomization. Note that the guarantee 
provided by the meta-algorithm, as stated above, is of the following form; the algorithm runs 
in no more than a specified time T, and with probability at least 1 — d it returns a solution 
that is an e-good approximation to the exact solution. Randomized algorithms that provide 
guarantees of this form, i.e., with running time that is is deterministic, but whose output 
may be incorrect with a certain small probability, are known as Monte Carlo algorithms m- 
A related class of randomized algorithms, known as Las Vegas algorithms, provide a different 
type of guaranatee: they always produce the correct answer, but the amount of time they 
take varies randomly m- In many applications of RandNLA algorithms, guarantees of this 
latter form are preferable. 

The notions of condition number and leverage scores have been described here only for very 
overdetermined £2 regression problems. However, as discussed in Section [3] below (as well as 
previously [niE9]), these notions generalize to very overdetermined ip, for p ^ 2, regression 
problems m as well as to p = 2 for less rectangular matrices, as long as one specihes a rank 
parameter k m- Understanding these generalizations, as well as the associated tradeoffs, will be 
important for developing RandNLA algorithms in parallel and distributed environments. 

2.3 Running Time Considerations in RAM 

As presented, the meta-algorithm of the previous subsection has a running time that depends 
on both the time to construct the probability distribution, {'n'i}2=i, and the time to solve the 
subsampled problem. For uniform sampling, the former is trivial and the latter depends on the 
size of the subproblem. For estimators that depend on the exact or approximate (recall the 
flexibility in ([3|) provided by 7 ) leverage scores, the running time is dominated by the exact or 
approximate computation of those scores. A naive algorithm involves using a QR decomposition 
or the thin SVD of A to obtain the exact leverage scores. This naive implementation of the meta¬ 
algorithm takes roughly 0{mn‘^/e) time and is thus no faster (in the RAM model) than solving 
the original LS problem exactly [acz]. There are two other potential problems with practical 
implementations of the meta-algorithm: the running time dependence of roughly 0 (mn^/e) time 
scales polynomially with I/e, which is prohibitive if one is interested in moderately small {e.g., 
10~^) to very small {e.g., 10“^®) values of e; and, since this is a randomized Monte Carlo algorithm, 
with some probability 6 the algorithm might completely fail. 

Importantly, all three of these potential problems can be solved to yield improved variants of 
the meta-algorithm. 
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• Making the algorithm fast: improving the dependence on m and n. We can make 
this meta-algorithm “fast” in worst-case theory in RAM |28l 1291 [T^ 1151120] , In particular, 
this meta-algorithm runs in O {mn log n/e) time in RAM if one does either of the following: 
if one performs a Hadamard-based random random projection and then performs uniform 
sampling in the randomly rotated basis [281129| (which, recall, is basically what random 
projection algorithms do when applied to vectors in a Euclidean space ID); or if one quickly 
computes approximations to the statistical leverage scores (using the algorithm of [15] . 
the running time bottleneck of which is applying a random projection to the input data) 
and then uses those approximate scores as an importance sampling distribution OCS]. 
In addition, by using carefully-constructed extremely-sparse random projections, both of 
these two approaches can be made to run in so-called “input sparsity time,” i.e., in time 
proportional to the number of nonzeros in the input data, plus lower-order terms that 
depend on the lower dimension of the input matrix [20]. 

• Making the algorithm high-precision: improving the dependence on e. We can 

make this meta-algorithm “fast” in practice, e.g., in “high precision” numerical implementa¬ 
tion in RAM |30ll31ll32ll33j . In particular, this meta-algorithm runs in 0(mn log n log (1/e)) 
time in RAM if one uses the subsampled problem constructed by the random projec- 
tion/sampling process to construct a preconditioner, using it as a preconditioner for a 
traditional iterative algorithm on the original full problem [301131[ 1^. This is important 
since, although the worst-case theory holds for any fixed e, it is quite coarse in the sense 
that the sampling complexity depends on e as 1/e and not log(l/e). In particular, this 
means that obtaining high-precision with (say) e = is not practically possible. In this 

iterative use case, there are several tradeoffs: e.g., one could construct a very high-quality 
preconditioner {e.g., using a number of samples that would yield a l-|-e error approximation 
if one solved the LS problem on the subproblem) and perform fewer iterations, or one could 
construct a lower quality preconditioner by drawing many fewer samples and perform a 
few extra iterations. Here too, the input sparsity time algorithm of [20| could be used to 
improve the running time still further. 

• Dealing with the 6 failure probability. Although fixing a failure probability 5 is con¬ 
venient for theoretical analysis, in certain applications having even a very small probability 
that the algorithm might return a completely meaningless answer is undesirable. In this 
case, one is interested in converting a Monte Carlo algorithm into a Las Vegas algorithm. 
Fortuitously, those application areas, e.g., scientific computing, are often more interested 
in moderate to high precision solutions than in low precision solutions. In these case, using 
the subsampled problem to create a preconditioner for iterative algorithms on the original 
problem has the side effect that one changes a “fixed running time but might fail” algorithm 
to an “expected running time but will never fail” algorithm. 

From above, we can make the following conclusions. The “fast in worst-case theory” variants 
of our meta-algorithm ([281 isg da laiini) represent qualitative improvements to the 0{mv?) 
worst-case asymptotic running time of traditional algorithms for the LS problem going back to 
Gaussian elimination. The “fast in numerical implementation” variants of the meta-algorithm 
([30l m [32]) have been shown to beat Lapack’s direct dense least-squares solver by a large 
margin on essentially any dense tall matrix, illustrating that the worst-case asymptotic theory 
holds for matrices as small as several thousand by several hundred m- 

While these results are a remarkable success for RandNLA in RAM, they leave open the ques¬ 
tion of how these RandNLA methods perform in larger-scale parallel/distributed environments, 
and they raise the question of whether the same RandNLA principles can be extended to other 
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common regression problems. In the remainder of this paper, we will review recent work showing 
that if one wants to solve I 2 regression problems in parallel/distributed environments, and if one 
wants to solve l\ regression problems in theory or in RAM or in parallel/distributed environ¬ 
ments, then one can use the same RandNLA meta-algorithm and design principles. Importantly, 
though, depending on the exact situation, one must instantiate the same algorithmic principles 
in different ways, e.g., one must worry much more about communication rather than FLOPS. 

3 Preliminaries on regression problems 

In this section, we will start in Section [3d] with a brief review of notation that we will use in 
the remainder of the paper. Then, in Sections 13.21 13.31 and 13.41 we will review 1^ regression 
problems and the notions of condition number and preconditioning for these problems. Finally, 
in Sections 13.51 and 13.61 we will review traditional deterministic solvers for £2 as well as and 
more general l-p regression problems. 

3.1 Notation conventions 

We briefly list the notation conventions we follow in this work: 

• We use uppercase letters to denote matrices and constants, e.g.. A, R, C, etc. 

• We use lowercase letters to denote vectors and scalars, e.g., x, b, p, m, n, etc. 

• We use II • lip to denote the Ip norm of a vector, || • II2 the spectral norm of a matrix, || • \\f 
the Frobenius norm of a matrix, and | • |p the element-wise ip norm of a matrix. 

• We use uppercase calligraphic letters to denote point sets, e.g., A for the linear subspace 
spanned by A’s columns, C for a convex set, and £ for an ellipsoid, except that O is used 
for big 0-notation. 

• The accent is used for sketches of matrices, e.g., A, the superscript is used for 
indicating optimal solutions, e.g., x*, and the accent is used for estimates of solutions, 
e.g., X. 

3.2 ip regression problems 

In this work, a parameterized family of linear regression problems that is of particular interest is 
the ip regression problem. 

Definition 1 {ip regression) Given a matrix A G a vector b G and p G [l,oo], the 

ip regression problem specified by A, b, and p is the following optimization problem: 

minimize^^gK" \\Ax - 6||p, (8) 

where the ip norm of a vector x is ||x||p = \xif’)^^'^, defined to be maxj \xi\ for p = 00. We 

call the problem strongly over-determined if n, and strongly under-determined if m-^n. 

Important special cases include the ^2 regression problem, also known as linear least squares 
(LS), and the ii regression problem, also known as least absolute deviations (LAD) or least 
absolute errors (LAE). The former is ubiquitous; and the latter is of particular interest as a 
robust regression technique, in that it is less sensitive to the presence of outliers than the former. 
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For general p G [l,oo], denote A!* the set of optimal solntions to ([8]). Let x* € X* be an 
arbitrary optimal solution, and let f* = \\Ax* — b\\p be the optimal objective value. We will be 
particularly interested in hnding a relative-error approximation, in terms of the objective value, 
to the general Ip regression problem Q. 

Definition 2 (Relative-error approximation) Given an error parameter e > 0, x £ MA is a 
(1 + e)-approximate solution to the ip regression problem ([8]) if and only if 

f = \\Ax - b\\p < (1 + e)/*. 

In order to make our theory and our algorithms for general ip regression simpler more concise, 
we can use an equivalent formulation of ([8|) in our discussion. 

minimize3,6]Rn ||Ax||p 

rj-i \ ^/ 

subject to c X = 1. 

Above, the “new” A is A concatenated with —5, i.e., (A —5) and c is a vector with a 1 at 
the last coordinate and zeros elsewhere, i.e., c € and c = (O ... 0 l), to force the last 

element of any feasible solution to be 1. We note that the same formulation is also used by [34] for 
solving unconstrained convex problems in relative scale. This formulation of ip regression, which 
consists of a homogeneous objective and an affine constraint, can be shown to be equivalent to 
the formulation of (|8|). 

Consider, next, the special case p = 2. If, in the LS problem 

minimizea;6Rn ||Ax - 6||2, (10) 

we let r = rank(A) < min(m,n), then recall that if r < n (the LS problem is under-determined 
or rank-deficient), then (|10l) has an infinite number of minimizers. In that case, the set of all 
minimizers is convex and hence has a unique element having minimum length. On the other 
hand, if r = n so the problem has full rank, there exists only one minimizer to (llOh and hence 
it must have the minimum length. In either case, we denote this unique min-length solution to 
(|10p by X* , and we are interested in computing x* in this work. This was dehned in Problem ([T]) 
above. In this case, we will also be interested in bounding ||x* — x||2, for arbitrary or worst-case 
input, where x was dehned in Problem ([2]) above and is an approximation to x*. 

3.3 ip-norm condition number 

An important concept in £2 and more general ip regression problems, and in developing efficient 
algorithms for their solution, is the concept of condition number. For linear systems and LS 
problems, the £2-iiorm condition number is already a well-established term. 

Definition 3 (^ 2 -norm condition nnmber) Given a matrix A £ with full column rank, 

let (T™®'^(A) be the largest singular value and (T™™(A) be the smallest singular value of A. The 
^2-norm condition number of A is defined as K 2 (A) = (T™'^^(A)/it™™(A). For simplicity, we use 
H 2 , and when the underlying matrix is clear from context. 

For general ip norm and general ip regression problems, here we state here two related notions of 
condition number and then a lemma that characterizes the relationship between them. 
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Definition 4 (fp-norm condition number (Clarkson et al. [19] ii Given a matrix A G 
and p G [1, oo], let 

(T™’“(j 4) = max IIAxIL and = min || 74 x|L. 

^ lb||2 = l ^ P V / |ix||2 = l 

T/ien, we denote by Kp{A) the t'p-norm condition number of A, defined to be: 

Kp{A) = a^^^A)/af-{A). 

For simplicity, we use Kp, a™™, and when the underlying matrix is clear. 

Definition 5 ((a, /3,p)-conditioning (Dasgupta et al. [3 5| )) Given a matrix A G and 

p G [l,oo], let II • llg be the dual norm of || • ||p. Then A is (a,/3,p)-conditioned if (1) |^|p < a, 
and (2) for all z G M"', || 2 ;||q < /3||j42;||p. Define Kp{A), the (a,/3,p)-condition number of A, as 
the minimum value of afi such that A is {a, fi,p)-conditioned. We use Rp for simplicity if the 
underlying matrix is clear. 

Lemma 1 (Equivalence of Kp and Rp (Clarkson et al. [T9| ii Given a matrix A G 
and p £ [l,oo], we always have 

n-|i/2-i/PlKp(^) < Rp{A) < 

That is, by Lemma [H if m S> n, then the notions of condition number provided by Definition 0] 
and Definition 0 are equivalent, up to low-dimensional factors. These low-dimensional factors 
typically do not matter in theoretical formulations of the problem, but they can matter in practical 
implement at ions. 

The ^p-norm condition number of a matrix can be arbitrarily large. Given the equivalence 
established by Lemma [U we say that a matrix A is well-conditioned in the ip norm if Kp or 
Rp = 0(poly(n)), independent of the high dimension m. We see in the following sections that the 
condition number plays a very important part in the analysis of traditional algorithms. 

3.4 Preconditioning ip regression problems 

Preconditioning refers to the application of a transformation, called the preconditioner, to a 
given problem instance such that the transformed instance is more-easily solved by a given class 
of algorithms. Most commonly, the preconditioned problem is solved with an iterative algorithm, 
the complexity of which depends on the condition number of the preconditioned problem. 

To start, consider p = 2, and recall that for a square linear system Ax = 6 of full rank, this 
preconditioning usually takes one of the following forms: 

left preconditioning M^Ax = M^b, 
right preconditioning ANy = b, x = Ny, 
left and right preconditioning M^ANy = M^b, x = Ny. 

Clearly, the preconditioned system is consistent with the original one, i.e., has the same x* as 
the unique solution, if the preconditioners M and N are nonsingular. 

For the general LS Problem ([1]), more care should be taken so that the preconditioned system 
has the same min-length solution as the original one. In particular, if we apply left preconditioning 
to the LS problem mlua, || Ax— 6 || 2 , then the preconditioned system becomes min^; ||M^Ax—M^ 6 || 2 , 
and its min-length solution is given by 

xfeft = (M^A)tM^6. 
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Similarly, the min-length solution to the right preconditioned system is given by 

= N{AN)h. 

The following lemma states the necessary and sufficient conditions for = N{AN)^ or = 
(M^A)^M'^ to hold. Note that these conditions holding certainly imply that = x* and 

^left ~ respectively. 

Lemma 2 (Left and right preconditioning (Meng et al. [32]) Given A G N G 

and M G we have 

1. A^ = N{AN)^ if and only if range{NN^A^) = range{A^), 

2. = {M^A)^M^ if and only if range{MM^A) = range{A). 

Given this preconditioned problem, m (see below) bounds the number of itrations for certain 
iterative algorithms for the LS problem. 

Just as with p = 2, for more general ip regression problems with matrix A G with full 

column rank, although its condition numbers Kp{A) and Rp{A) can be arbitrarily large, we can 
often find a matrix R G such that AR~^ is well-conditioned. (This is not the R from a 

QR decomposition of A, unless p = 2, but some other matrix R.) In this case, the ip regression 
Problem ([9]) is equivalent to the following well-conditioned problem: 

minimizeygR>^ \\AR~^y\\p, 

subject to c R~ y = 1- 

Clearly, if y* is an optimal solution to m, then X* = R is an optimal solution to ([9]), and 
vice versa; however, (jlip may be easier to solve than ([9]) because of better conditioning. 

Since we want to reduce the condition number of a problem instance via preconditioning, it 
is natural to ask what the best possible outcome would be in theory. For p = 2, an orthogonal 
matrix, e.g., the matrix Q computed from a QR decomposition, has K 2 {Q) = 1- More generally, 
for the fp-norm condition number Kp, we have the following existence result. 

Lemma 3 Given a matrix A G with full column rank and p G [l,oo], there exist a matrix 

R G such that Kp{AR~^) < 

This is a direct consequence of John’s theorem [36] on ellipsoidal rounding of centrally symmetric 
convex sets. For the (a,/3,p)-condition number kp, we have the following lemma. 

Lemma 4 Given a matrix A G R™^”' with full column rank and p G [l,oo], there exist a matrix 
R G R”^” such that Kp{AR~^) < n. 

Note that Lemmas [3] and 0] are both existential results. Unfortunately, except the case when 
p = 2, no polynomial-time algorithm is known that can provide such preconditioning for general 
matrices. Below, in Section 01 we will discuss two practical approaches for fp-norm precondition¬ 
ing: via ellipsoidal rounding and via subspace embedding, as well as subspace-preserving sampling 
algorithms built on top of them. 

3.5 Traditional solvers for £2 regression 

Least squares is a classic problem in linear algebra. It has a long history, tracing back to Gauss, 
and it arises in numerous applications. A detailed survey of numerical algorithms for least squares 
is certainly beyond the scope of this work. In this section, we briefly describe some well-known 
direct methods and iterative methods that compute the min-length solution to a possibly rank- 
dehcient least squares problem, and we refer readers to Bjorck m for additional details. 
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Direct methods 

It is well known that the min-length solution of a least squares problem can be computed using 
the singular value decomposition (SVD). Let A = UYiV'^ be the compact SVD, where U G 
S € and V € R”'^'’, i.e., only singular vectors corresponding to the non-zero singular 

values are calculated. We have x* = VT,~^U^b. The matrix is the Moore-Penrose 

pseudoinverse of A, denoted by A^, which is defined and unique for any matrix. Hence we can 
simply write x* = The SVD approach is accurate and robust to rank-deficiency. 

Another way to solve a least squares problem is using complete orthogonal factorization. If 
we can find orthonormal matrices Q G R”*^’' and Z G R”'^'’, and a matrix T G R'"^’’, such that 
A = QTZ^, then the min-length solution is given by x* = ZT~^Q^b. We can treat SVD as a 
special case of complete orthogonal factorization. In practice, complete orthogonal factorization 
is usually computed via rank-revealing QR factorizations, making T a triangular matrix. The 
QR approach is less expensive than SVD, but it is slightly less robust at determining the rank. 

A third way to solve a least squares problem is by computing the min-length solution to the 
normal equation AAAx = AAb, namely 

X* = {A^A)^A^b = A^{AA^)h. (12) 

It is easy to verify the correctness of (I12p by replacing A by its compact SVD If r = 

min(m, n), a Cholesky factorization of either AAA (if m > n) or AAA (if m < n) solves (jl2p . If 
r < min(m, n), we need the eigensystem of AAA or AAA to compute x*. The normal equation 
approach is the least expensive among the three direct approaches we have mentioned, but it is 
also the least accurate one, especially on ill-conditioned problems. See Chapter 5 of Golub and 
Van Loan |38] for a detailed analysis. A closely related direct solver is the semi-normal equation 
method. It is often useful when the R-factor of the QR decomposition is known; see [39] for more 
details. 

For sparse least squares problems, by pivoting A’s columns and rows, we may find a sparse 
factorization of A, which is preferred to a dense factorization for more efficient storage. For sparse 
direct methods, we refer readers to Davis [40] . 


Iterative methods 


Instead of direct methods, we can use iterative methods to solve (flop . If all the iterates are 

in range(A^) and if converges to a minimizer, it must be the minimizer having minimum 

length, i.e., the solution to Problem ([T|). This is the case when we use a Krylov subspace method 
starting with a zero vector. For example, the conjugate gradient (CG) method on the normal 
equation leads to the min-length solution (see Paige and Saunders m)- In practice, CGLS [42] , 
LSQR [l3j are preferable because they are equivalent to applying CG to the normal equation in 
exact arithmetic but they are numerically more stable. Other Krylov subspace methods such as 
LSMR [44] can also solve (jiop as well. The Chebyshev semi-iterative method [45] can also be 
modified to solve LS problems. 

Importantly, however, it is in general hard to predict the number of iterations for CG-like 
methods. The convergence rate is affected by the condition number of A. A classical result 
[46] p.l87] states that 
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where ||-2 ||^t^ = A'Az = ||Az|p for any z G R”', and where A) is the condition number of 
A A under the 2-norm. Estimating k{A A) is generally as hard as solving the LS problem itself, 
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and in practice the bound does not hold in any case unless reorthogonalization is used. Thus, 
the computational cost of CG-like methods remains unpredictable in general, except when A 
is very well-conditioned and the condition number can be well estimated. 


3.6 Traditional solvers for ii and more general Ip regression 

While (.2 regression can be solved with direct methods such as SVD and QR, the solution of 
general (.p regression has to rely on iterative methods due to the lack of analytical solution. In 
particular, and ^oo regression problems can be formulated as linear programs and solved by 
linear programming solvers, and general ^p regression problems can be formulated as convex pro¬ 
grams and hence solved by general convex solvers. This, however, comes at the cost of increased 
complexity, compared to the (.2 case. For example, it is easy to see that all Ip regression prob¬ 
lems are convex due to the convexity of vector norms. Therefore, standard convex solvers, e.g., 
gradient-based methods interior-point methods (IPMs) [48] . and interior-point cutting-plane 
methods lIPCPMsi|49] can be used to solve ip regression problems. Discussing those convex 
solvers is beyond the scope of the work. We refer readers to the monographs mentioned above or 
Boyd and Vandenberghe |50] for a general introduction. 

When p = 1 or 00 , the problem is still convex but not smooth. Subgradient methods [5l] 
or gradient methods with smoothing |52] can be used to handle non-smoothness, while another 
solution is via linear programming. In particular, an ii regression problem specified by T G 
and b G is equivalent to the following linear program: 

minimize lm2/+ + ^mV— 
subject to Ax — 5 = ?/+ — y_, 

y+, y- > 0, y+,y- e M™, x g M”, 

and an ioo regression problem specified by A and b is equivalent to the following: 


minimize y 

subject to — y < Ax — b < y, 
y G M, X G M”, 


where Im £ 1^"* indicates a vector of length m with all ones. As a linear programming problem, 
an ii or i^o regression problem can be solved by any linear programming solver, using the simplex 
method [53] or IPMs. Similar to the case for least squares, the ip condition number affects the 
performance of ip regression solvers, e.g., on the convergence rate for subgradient [5T] or gradient 
method [54], on the search of an initial feasible point for IPMs [55], and on the initial search region 
for ellipsoid methods and IPCPMs [l9]. Generally speaking, a smaller ip condition number makes 
the problem easier to solve. 

Another popular way to solve ip regression problems is via iteratively re-weighted least squares 
(IRLS) [56], which solves a sequence of weighted least squares problems and makes the solutions 
converge to an optimal solution of the original ip regression problem. At step k, it solves the 
following weighted least squares problem: 

^(fc+i) _ ^j.g IIIT^^^fAx — 6)||2, 


where is a diagonal matrix with positive diagonals wf^\ i = l,...,m. Let be an 

identity matrix and choose 



iTx^^^-bAP-\ 


i = 1,... ,m, k = 1,.... 
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until converges. The choice of is often smoothed to avoid dividing by zero in practice. 

It is not hard to show that if converges, it converges to an optimal solution of the £p regres¬ 

sion problem. However, the convergence theory of IRLS only exists under certain assumptions 
and the convergence rate is much harder to derive. See Burrus |57] for a survey of related work. 

4 Rounding, embedding, and sampling ip regression problems 

Preconditioning, ellipsoidal rounding, and low-distortion subspace embedding are three core tech¬ 
nical tools underlying RandNLA regression algorithms. In this section, we will describe in detail 
how these methods are used for Ip regression problems, with an emphasis on tradeoffs that arise 
when applying these methods in parallel and distributed environments. Recall that, for any ma¬ 
trix A € with full column rank, Lemmas [3] and H] above show that there always exists a 

preconditioner matrix R E such that AR~^ is well-conditioned, for Ip regression, for general 

p E [1, oo]. For p = 2, such a matrix R can be computed in 0{m'n?) time as the “R” matrix from 
a QR decomposition, although it is of interest to compute other such preconditioner matrices R 
that are nearly as good more quickly; and for p = 1 and other values of p, it is of interest to 
compute a preconditioner matrix R in time that is linear in m and low-degree polynomial in n. 
In this section, we will discuss these and related issues. 

In particular, in Sections 14.II and 14.21 we discuss practical algorithms to find such R matrices, 
and we describe the trade-offs between speed {e.g., FLOPS, number of passes, additional space/- 
time, etc.) and conditioning quality. The algorithms fall into two general families: ellipsoidal 
rounding fSection l4.ip and subspace embedding fSection l4.2p . We present them roughly in the or¬ 
der of speed (in the RAM model), from slower ones to faster ones. We will discuss practical trade¬ 
offs in Section m For simplicity, here we assume m ^ poly(n), and hence mn -|- poly(n); 

and if A is sparse, we assume that mn ^ nnz(A). Hereby, the degree of poly(n) depends on the 
underlying algorithm, which may range from 0{n) to 

Before diving into the details, it is worth mentioning a few high-level considerations about 
subspace embedding methods. (Similar considerations apply to ellipsoidal rounding methods.) 
Subspace embedding algorithms involve mapping data points, e.g., the columns of an m x n 
matrix, where m n to a lower-dimensional space such that some property of the data, e.g., 
geometric properties of the point set, is approximately preserved; see Definition [7] for definition 
for low-distortion subspace embedding matrix. As such, they are critical building blocks for 
developing improved random sampling and random projection algorithms for common linear 
algebra problems more generally, and they are one of the main technical tools for RandNLA 
algorithms. There are several properties of subspace embedding algorithms that are important in 
order to optimize their performance in theory and/or in practice. For example, given a subspace 
embedding algorithm, we may want to know: 

• whether it is data-oblivious {i.e., independent of the input subspace) or data-aware {i.e., 
dependent on some property of the input matrix or input space), 

• the time and storage it needs to construct an embedding, 

• the time and storage to apply the embedding to an input matrix, 

• the failure rate, if the construction of the embedding is randomized, 

• the dimension of the embedding, i.e., the number of dimensions being sampled by sampling 
algorithms or being projected onto by projection algorithms, 

• the distortion of the embedding, and 
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scores approximation 


Figure 1: Overview of relationships between several core technical components in RandNLA 
algorithms for solving ip regression. Relevant subsection and tables in this section are also shown. 
A directed edge implies the tail component contribntes to the head component. 


• how to balance the trade-offs among those properties. 

Some of these considerations may not be important for typical theoretical analysis but still affect 
the practical performance of implementations of these algorithms. 

After the discnssion of rounding and embedding methods, we will then show in Section 14.31 
that ellipsoidal rounding and subspace embedding methods (that show that the ip norms of the 
entire subspace of vectors can be well-preserved) can be used in one of two complementary ways: 
one can solve an ip regression problem on the rounded/embedded subproblem; or one can use the 
rounding/embedding to construct a preconditioner for the original problem. (We loosely refer 
to these two complementary types of approaches as low-precision methods and high-precision 
methods, respectively. The reason is that the running time complexity with respect to the error 
parameter e for the former is poly(l/e), while the running time complexity with respect to e for 
the latter is log(l/e).) We also discuss various ways to combine these two types of approaches to 
improve their performance in practice. 

Since we will introduce several important and distinct but closely-related concepts in this 
long section, in Figured] we provide an overview of these relations as well as of the structure of 
this section. 

4.1 Ellipsoidal rounding and fast ellipsoid rounding 

In this subsection, we will describe ellipsoidal rounding methods. In particular, we are interested 
in the ellipsoidal rounding of a centrally symmetric convex set and its application to I'p-norm 
preconditioning. We start with a definition. 

Definition 6 (Ellipsoidal ronnding) Let C C M” he a convex set that is full-dimensional, 
closed, bounded, and centrally symmetric with respect to the origin. An ellipsoid £{0,E) = {x E 
M” I ||Ex||2 < 1} is a K-rounding of C if it satisfies £f k C C C £, for some k > 1, where £Ik 
means shrinking £ by a factor of Ifn. 
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K 

time 

# passes 

# calls to oracle 

ER [5I1135] 

(n(n + 1 ))^/^ 

0{mn^ logm) 

log m 

0{n^ logm) 

Fast ER [19] 

2 n 

0{mn^ logm) 

nlogm 

0{n^ logm) 

Single-pass ER [U2] 

2 nl2/p-i|+i 

0{mn^ logm) 

1 

0{jn? logm)* 


Table 3: Summary of several ellipsoidal rounding for ip conditioning. Above, the * superscript 
denotes that the oracles are described and called through a smaller matrix with size m/n by n. 


Finding an ellipsoidal rounding with a small n factor for a given convex set has many appli¬ 
cations such as in computational geometry [58], convex optimization [59], and computer graph¬ 
ics |60j . In addition, the .^p-norm condition number Up naturally connects to ellipsoidal round¬ 
ing. To see this, let C = {x G M” | ||Ax||p < 1} and assume that we have a K-rounding of C: 
£ = {x\ ||i?x ||2 < 1}. This implies 

||^2:||2 < ||Ax||p < Ac||iix||2, VxGM”. 

If we let y = Rx, then we get 

llylb < \\AR-^y\\p < K\\y\\2, Vy G M” . 

Therefore, we have Kp{AR~^) < k. So a K-rounding of C leads to a ^-preconditioning of A. 

Recall the well-known result due to John [36] that for a centrally symmetric convex set C there 
exists a n^/^-rounding. It is known that this result is sharp and that such rounding is given by 
the Lowner-John (LJ) ellipsoid of C, i.e., the minimal-volume ellipsoid containing C. This leads 
to Lemma [3] above. Unfortunately, finding an n^/^-rounding is a hard problem. No constant- 
factor approximation in polynomial time is known for general centrally symmetric convex sets, 
and hardness results have been shown |59] . 

To state algorithmic results, suppose that C is described by a separation oracle and that we 
are provided an ellipsoid £q that gives an L-rounding for some L > 1. In this case, we can find 
a (n(n -|- l))^/^-rounding in polynomial time, in particular, in O(n^logL) calls to the oracle; see 
Lovasz [591 Theorem 2.4.1]. (Polynomial time algorithms with better k have been proposed for 
special convex sets, e.g., the convex hull of a finite point set [HT] and the convex set specified 
by the matrix i^o norm |54j.l This algorithmic result was used by Clarkson m and then by 
Dasgupta et al. [35| for ip regression. Note that, in these works, only 0(n)-rounding is actually 
needed, instead of (n(n -|- l))^/^-rounding. 

Recent work has focused on constructing ellipsoid rounding methods that are much faster 
than these more classical techniques but that lead to only slight degredation in preconditioning 
quality. See Table [3] for a summary of these results. In particular, Clarkson et al. [19] follow the 
same construction as in the proof of Lovasz [59] but show that it is much faster (in O(re^logL) 
calls to the oracle) to find a (slightly worse) 2 n-rounding of a centrally symmetric convex set in 
M"" that is described by a separation oracle. 

Lemma 5 (Fast ellipsoidal rounding (Clarkson et al. (19| I) Given a centrally symmetric 
convex set C C R”, which is centered at the origin and described by a separation oracle, and an 
ellipsoid £q centered at the origin such that £olL <Z C £q for some L > 1, it takes at most 
3.15n^logL calls to the oracle and additional 0{n^ log L) time to find a 2n-rounding ofC. 

By applying Lemma [5| to the convex set C = {x | ||Ax||p < 1}, with the separation oracle de¬ 
scribed via a subgradient of ||Ax||p and the initial rounding provided by the “R” matrix from the 
QR decomposition of A, one immediately improves the running time of the algorithm used by 
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Algorithm 1 A single-pass conditioning algorithm. 

Input: A G with full column rank and p € [1, oo]. 

Output: A non-singular matrix E G such that 


\\y\\2<\\AEy\\p<2n^^/P-^^+%\\2, Vy G 


1 : Partition A along its rows into sub-matrices of size n? x n, denoted by Ai,..., Am- 
2 : For each Ai, compute its economy-sized SVD: Ai = UiEiV^. 

3: Let Ai = for i = 1,..., M, 


C = 


X G 


M \^/P 

11^*^112) 



and A 



4: Compute i’s SVD: A = UtV^. 

5: Let £0 = £{0,Eo) where Eq = n“'^x{i/p-i/ 2 ,o}y^-i^ 

6: Compute an ellipsoid £ = £{0,E) that gives a 2n-rounding of C starting from £q that gives 
an (Mn^)l^/^“^/^l-rounding of C. 

7 - Return nmin{i/p-i/2,o}^_ 


Clarkson m and by Dasgupta et al. [35] from 0{mn^ log m) to 0{mn^ log m) while maintaining 
an 0(n)-conditioning. 

Corollary 1 Given a matrix A G with full column rank, it takes at most 0(mn^ log m) 

time to find a matrix R G such that Kp{AR~^) < 2n. 

Unfortunately, even this improvement for computing a 2n-conditioning is not immediately 
applicable to very large matrices. The reason is that such matrices are usually distributively 
stored on secondary storage and each call to the oracle requires a pass through the data. We 
could group n calls together within a single pass, but this would still need C>(nlogm) passes. 
Instead, Meng and Mahoney |62] present a deterministic single-pass conditioning algorithm that 
balances the cost-performance trade-off to provide a 2nl^/^“^l^^-conditioning of A |62] . This 
algorithm essentially invoke the fast ellipsoidal rounding (Lemma [5|) method on a smaller problem 
which is constructed via a single-pass on the original dataset. Their main algorithm is stated in 
Algorithm [H and the main result for Algorithm [1] is the following. 

Leuima 6 (Oue-pass couditiouiug (Meug aud Mahouey [62j)) AlgorithmUlis a 2^\2/p-1\+1_ 
conditioning algorithm, and it runs in 0{{mn^ + n^)logm) time. It needs to compute a 2n- 
rounding on a problem with size rajn by n which needs O(n^logm) calls to the separation oracle 
on the smaller problem. 


Remark 1 Solving the rounding problem of size mjn x n in Algorithmic requires 0{m) RAM, 
which might be too much for very large-scale problems. In such cases, one can increase the 
block size from to, e.g., n^. A modification to the proof of LemmalE shows that this gives 
us a -conditioning algorithm that only needs 0{m/n) RAM and 0((mn-|-n^) log m) 

FLOPS for the rounding problem. 
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Remark 2 One can replace SVD computation in Algorithmic by a fast randomized ^2 subspace 
embedding (i.e., a fast low-rank approximation algorithm as described in UM W and that we 
describe below). This reduces the overall running time to 0{{mn -\- n'^)\og{mn)), and this is 
an improvement in terms of FLOPS; but this would lead to a non-deterministic result with ad¬ 
ditional variability due to the randomization (that in our experience substantially degrades the 
embedding/conditioning quality in practice). How to balance those trade-offs in real applications 
and implementations depends on the underlying application and problem details. 

4.2 Low-distortion subspace embedding and subspace-preserving embedding 

In this subsection, we will describe in detail subspace embedding methods. Subspace embedding 
methods were first used in RandNLA by Drineas et al. in their relative-error approximation 
algorithm for £2 regression (basically, the meta-algorithm described in Section I2.ip |14] : they 
were first used in a data-oblivious manner in RandNLA by Sarlos [28]; and an overview of data- 
oblivious subspace embedding methods as used in RandNLA has been provided by Woodruff m- 
Based on the properties of the subspace embedding methods, we will present them in the following 
four categories. In Section 14.2.11 and 14. 2. 21 we will discuss the data-oblivious subspace embedding 
methods for £2 and £i norms, respectively; and then in Section [4.2.31 and 14.2.41 we will discuss the 
data-aware subspace embedding methods for £2 and £i norms, respectively. Before getting into 
the details of these methods, we first provide some background and dehnitions. 

Let us denote by ^ C the subspace spanned by the columns of A. A subspace embedding 
of A into with s > 0 is a structure-preserving mapping f : A ^ where the meaning 
of “structure-preserving” varies depending on the application. Here, we are interested in low- 
distortion linear embeddings of the normed vector space Ap = {A, || • ||p), the subspace A paired 
with the £p norm || • ||p. (Again, although we are most interested in £i and £ 2 , some of the results 
hold more generally than for just p = 2 and p = 1, and so we formulate some of these results for 
general p.) We start with the following definition. 

Definition 7 (Low-distortion £p snbspace embedding) Given a matrix A G and p G 

[1,00], G is an embedding of Ap if s = 0(poly(n)), independent of m, and there exist 

<7# > 0 and > 0 such that 

0-$ • blip < W^vWp < • blip, Vy G ^p. 

We call 4' a low-distortion subspace embedding of Ap if the distortion of the embedding = 
0(poly(n)), independent ofm. 

We remind the reader that low-distortion subspace embeddings can be used in one of two related 
ways: for £p-noTui preconditioning and/or for solving directly £p regression subproblems. We will 
start by establishing some terminology for their use for preconditioning. 

Given a low-distortion embedding matrix 4* of Ap with distortion let R be the “R” matrix 
from the QR decomposition of 4*^. Then, the matrix AR~^ is well-conditioned in the £p norm. 
To see this, note that we have 

||AR“^x||p < (T$fi;$||4>Aii“^x||p < ■ ||4>Ai?“^||2 • ||x||2 

where the hrst inequality is due to low distortion and the second inequality is due to the equiva¬ 
lence of vector norms. By similar arguments, we can show that 

||AR“^x||p > (T$ • IbAR^^IIp > • ||4 'AR“^x||2 

= • ||x||2, Vx G . 


20 










name 

K, 

running time 

ff passes 

type 

norm 

ER (HUES] 

(n(n -|- 1))^/^ 

0{mn^ log m) 

0{n^L) 

ER 

h 

Fast ER [19] 

2 n 

0{mn^ log m) 

0{nL) 

ER 

h 

Single-pass ER [B2| 

2 n^ 

0{mn^ log m) 

1 

ER 

Q 

CT [63] 

log3/2 

0{mn^ logn) 

1 

QR 

Q 

FCT US] 

log9/2 

0{mn logn) 

1 

QR 

Q 

SPCT [62] 

^(^ 11/2 n) 

0 (nnz(A)) 

1 

QR 

h 

SPCT2 [62] 

6 n 

0 (nnz(A) • log n) 

2 

QR+ER 

h 

RET [M] 

log5/2 

0 (nnz(A)) 

1 

QR 

h 

Gaussian 

0 (1) 

0{mn^) 

1 

QR 

h 

SRHT [6511291 IS] 

0 (1) 

0{mn logm) 

1 

QR 

Q 

cw [SnilMlE^ 

0 (1) 

0 (nnz(A)) 

1 

QR 

Q 


Table 4: Summary of of l\ and £2 norm conditioning methods. QR and ER refer, respectively, to 
methods based on the QR factorization and methods based on Ellipsoid Rounding, as discussed 
in the text. 


Hence, by combining these results, we have 

Kp{AR~^) < = 0(poly(n)), 

he., the matrix AR~^ is well-conditioned in the ip norm. We call a conditioning method that 
is obtained via computing the QR factorization of a low-distortion embedding a QR-type method; 
and we call a conditioning method that is obtained via an ellipsoid rounding of a low-distortion 
embedding an ER-type method. 

Furthermore, one can construct a well-conditioned basis by combining QR-like and ER-like 
methods. To see this, let R be the matrix obtained by applying Corollary [T] to ^A. We have 

< 2na<s>K<s,\\x\\2, Vx G M”, 

where the second inequality is due to the ellipsoidal rounding result, and 

||HR“^x||p > > cj$||x||2, Vx G M"' . 


Hence 

Kp{AR~^) < 2nK$ = 0(poly(n)) 

and AR~^ is well-conditioned. Following our previous conventions, we call this combined type of 
conditioning method a QR-hER-type method. 

In Table m we summarize several different types of conditioning methods for ii and ^2 condi¬ 
tioning. Comparing the QR-type approach and the ER-type approach to obtaining the precon¬ 
ditioner matrix R, we see there are trade-offs between running times and conditioning quality. 
Performing the QR decomposition takes 0{sn^) time, which is faster than fast ellipsoidal rounding 
that takes O(sn^logs) time. However, the latter approach might provide a better conditioning 
quality when 2n < We note that those trade-offs are not important in most theoret¬ 

ical formulations, as long as both take 0(poly(n)) time and provide 0(poly(n)) conditioning, 
independent of m, but they certainly do affect the performance in practice. 

A special family of low-distortion subspace embedding that has very low distortion factor is 
called subspace-preserving embedding. 
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Definition 8 (Subspace-preserving embedding) Given a matrix A G p G [l,oo] and 

e G (0,1), $ G is a subspace-preserving embedding of Ap if s = 0(poly(n)), independent of 

m, and 

(1 - e) • lly|lp < ll%llp < (1 + e) • blip, Vy G . 


4.2.1 Data-oblivious low-distortion £2 subspace embeddings 

An £2 subspace embedding is distinct from but closely related to the embedding provided by the 
Johnson-Lindenstrauss (J-L) lemma. 


Lemma 7 (Johnson-Lindenstrauss lemma |68|) Given e G (0,1), a point set A of N points 
in M”*, there is a linear map cj) : ^ with s = C\ogN/e^, where C > 0 is a global constant, 

such that 


{l-e)\\x-yf < |b(x) < (l + e)||x-y|b Vx,y G A. 


We say a mapping has J-L property if it satisfies the above condition with a constant probability. 


The original proof of the J-L lemma is done by constructing a projection from M"* to a randomly 
chosen s-dimensional subspace. The projection can be represented by a random orthonormal 
matrix in Indyk and Motwani [69] show that a matrix whose entries are independent 

random variables drawn from the standard normal distribution scaled by also satisfies the 

J-L property. This simplifies the construction of a J-L transform, and it has improved algorithmic 
properties. Later, Achlioptas HO] show that the random normal variables can be replaced by 
random signs, and moreover, we can zero out approximately 2/3 of the entries with proper 
scaling, while still maintaining the J-L property. The latter approach allows faster construction 
and projection with less storage, although still at the same order as the random normal projection. 

The original J-L lemma applies to an arbitrary set of N vectors in By using an e- 

net argument and triangle inequality, Sarlos [28| shows that a J-L transform can also preserve 
the Enclidean geometry of an entire n-dimensional subspace of vectors in M™, with embedding 
dimension 0(nlog(n/e)/e^). 

Lemma 8 (Sarlos |28j) Let A 2 he an arbitrary n-dimensional subspace ofMJ^ and 0 < e,(5 < 1. 
//<f> is a J-L transform from to 0{nlog{n/e)/e'^ ■ f{5)) dimensions for some function f. Then 

Pr(Vx G A 2 : Ibib — lbx|| 2 | < e||a:|| 2 ) > 1 — J. 

The result of Lemma [8] applies to any J-L transform, i.e., to any transform (including those with 
better or worse asymptotic FLOPS behavior) that satisfies the J-L distortion property. 

It is important to note, however, that for some J-L transforms, we are able to obtain more re¬ 
fined results. In particular, these can be obtained by bounding the spectral norm of (<I>C/)^(<I>C/) — 
/, where U is an orthonormal basis of .A 2 . If ||($[/)^($[/) — /|| < e, for any x G .42, we have 

|||4>x||i - llxllil = |(17x)^((4>C/)^(4>C/) - L){Ux)\ < e\\Ux\\l = e|k||i, 


and hence 


lll^a^lb - Ikibl < 


ex 


< ex 


2 - 


|<hx||2 -I- ||x||2 

We show some results following this approach. First consider the a random normal matrix, 
which has the following concentration result on its extreme singular values. 
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Lemma 9 (Davidson and Szarek |[7lj) Consider an s x n random matrix G with s > n, 
whose entries are independent random variables following the standard normal distribution. Let 
the singular values be ai > ■ ■ ■ > an- Then for any i > 0, 

max {Pr(cri > y/s + \/n + t), Pr((T„ < y/s — y/n — t)} < (14) 

Using this concentration result, we can easily present a better analysis of random normal projec¬ 
tion than in Lemma [8l 

Corollary 2 Given an n-dimensional subspaee A 2 C and e, 5 G (0,1), let G G be 

a random matrix whose entries are independently drawn from the standard normal distribution. 
There exist s = 0{{^/n + log(l/(5))^/e^) sueh that, with probability at least 1 — 5, we have 

(1 - e)||x ||2 < < (1 + e)lk|| 2 , Vx G ^2 • 

Dense J-L transforms, e.g., a random normal projection and its variants, use matrix-vector 
multiplication for the embedding. Given a matrix A G computing A = ^A takes 0(nnz(^)- 

s) time when <1> is a dense matrix of size s x m and nnz(^) is the number of non-zero elements 
in A. There is also a line of research work on “fast” J-L transforms that started with [TllEs]. 
These use FFT-like algorithms for the embedding, and thus they lead to O(mlogm) time for each 
projection. Hence, computing A = ^A takes 0(mn log m) time when is a fast J-L transform. 
Before stating these results, we borrow the notion of FJLT from |72l I73j and use that to define a 
stronger and faster version of the simple J-L transform. 

Definition 9 (FJLT) Given an n-dimensional subspace A 2 C we say G is an 

F,JLT for A 2 if 4* satisfies the following two properties: 

• \\{^U)'^ {^U) — In \\2 < e, where U is an orthonormal basis of A 2 . 

• Given any x G M”", 4>x can be computed in at most O(mlogm) time. 

Ailon and Chazelle construct the so-called fast Johnson-Lindenstrauss transform (FJLT) |73] . 
which is a product of three matrices = PHD, where P G is a sparse J-L transform 

with approximately 0(slog^ A^) nonzeros, H G is a normalized Walsh-Hadamard matrix, 

and D G is a diagonal matrix with its diagonals drawn independently from { — 1,1} with 

probability 1/2. Because multiplying H with a vector can be done in O(mlogm) time using 
an FFT-like algorithm, it reduces the projection time from 0{sm) to 0{mlogm). This FJLT 
construction is further simplified by Ailon and Liberty [HIES]. 

A subsequently-refined FJLT was analyzed by Tropp [65], and it is named the subsampled 
randomized Hadamard transform (SRHT). As with other FJLT methods, the SRHT preserves 
the geometry of an entire £2 subspace of vectors by using a matrix Chernoff inequality to bound 
II(<hl7)^(<h[/) — /II 2 . We describe this particular FJLT in more detail. 

Definition 10 An SRHT is an s x m matrix of the form 

$ = ^RHD, 

where 


• D G is a diagonal matrix whose entries are independent random signs, 

• 77 G is a Walsh-Hadamard matrix scaled by , 

• 7? G restricts an n-dimensional vector to s coordinates, chosen uniformly at random. 


23 


Below we present the main results for SRHT from m since it has a better characterization of the 
subspace-preserving properties. We note that its proof is essentially a combination of the results 
in [63129]. 

Lemma 10 (SRHT |65l 1291115| l Given an n-dimensional subspace A 2 C M™' and e, 5 E (0,1), 
let <1> € be a random SRHT with embedding dimension s > ^ 30 ^in(40mn) ^ ^ 

Then, with probability at least 0.9, we have 

(1 - e)||x||2 < ll^^lb < (1 + e)||x||2, Vx E .A2 . 

Note that besides Walsh-Hardamard transform, other FFT-based transform, e.g., discrete Hartley 
transform (DHT), discrete cosine transform (DCT) which have more practical advantages can be 
also be used; see m for an details of other choices. Another important point to keep in mind 
(in particular, for parallel and distributed applications) is that, although called “fast,” a fast 
transform might be slower than a dense transform: when nnz(A) = 0{m) (since machines are 
optimized for matrix-vector multiplies); when A’s columns are distributively stored (since this 
slows down FFT-like algorithms, due to communication issues); or for other machine-related 
issues. 

More recently, Clarkson and Woodruff |2D] developed an algorithm for the I 2 subspace em¬ 
bedding that runs in so-called input-sparsity time, i.e., in 0(nnz(A)) time, plus lower-order terms 
that depend polynomially on the low dimension of the input. Their construction is exactly the 
CountSketch matrix in the data stream literature m. which is an extremely simple and sparse 
matrix. It can be written as the product of two matrices = SD E where S E 

has each column chosen independently and uniformly from the s standard basis vectors of M* 
and D E is a diagonal matrix with diagonal entries chosen independently and uniformly 

from ±1. By decoupling A into two orthogonal subspaces, called “heavy” and “light” based on 
the row norms of U, an orthonormal basis of A, i.e., based on the statistical leverage scores of 
A, they proved that with an embedding dimension 0{n^ je^), the above construction gives an £2 
subspace embedding matrix. Improved bounds and simpler proofs (that have much more linear 
algebraic flavor) were subsequently provided by Mahoney and Meng |66] and Nelson and Nguyen 
[67j . In rest of this paper, we refer to this method as CW. Below, we present the main results 
from [201IM1[67|. 


Lemma 11 (Input-sparsity time embedding for ^2 [20|, I66|, 167] ) Given an n-dimensional 
subspace A 2 C and any 6 E (0,1), let s = (n^-|-n)/(e^(5). Then, with probability at least 1 — 5, 

(1 - e)||x||2 < ||$a;||2 < (1 + e)||2:||2, Vx E A 2 , 
where E is the GountSketch matrix described above. 

Remark 3 It is easy to see that computing ^A, i.e., computing the subspace embedding, takes 
C)(nnz(A)) time. The C)(nnz(A)) running time is indeed optimal, up to constant factors, for 
general inputs. Gonsider the case when A has an important row Oj such that A becomes rank- 
deficient without it. Thus, we have to observe o, in order to compute a low-distortion embedding. 
However, without any prior knowledge, we have to scan at least a constant portion of the input to 
guarantee that ai is observed with a constant probability, which takes 0(nnz(A)) time. Also note 
that this optimality result applies to general ip norms. 

To summarize, in Table (3 we provide a summary of the basic properties of several data- 
oblivious £2 subspace embeddings discussed here (as well as of several data-aware £2 subspace¬ 
preserving embeddings that will be discussed in Section l4.2.3p . 
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name 

running time 

s 


Gaussian (REF) 

0{mns) 


1 + e 

SRHT [Mlisa[IS] 

0{mn logm) 

0 (nlog(mn) log(n/e^)/e 

2 ) 1 + e 

CW |20l|66l EZj 

0 (nnz(A)) 

(n^ -|- n)/e^ 

1 + e 

Exact lev. scores sampling [14] 

0{mn^) 

0{n logn/e^) 

1 + e 

Appr. lev. scores sampling (SRHT) [15] 

0{mn logm) 

0{n\ogn/e^) 

1 + e 

Appr. lev. scores sampling (CW) [20l [15] 

0 (nnz(A)) log m 

0{n\ogn/e^) 

1 + e 


Table 5: Summary of data-oblivious and data-aware £2 embeddings. Above, s denotes the embed¬ 
ding dimension. By running time, we mean the time needed to compute ^A. For each method, 
we set the failure rate to be a constant. Moreover, “Exact lev. scores sampling” means sampling 
algorithm based on using the exact leverage scores (as importance sampling probabilities); and 
“Appr. lev. scores sampling (SRHT)” and “Appr. lev. scores sampling (CW)” are sampling 
algorithms based on approximate leverage scores estimated by using SRHT and CW (using the 
algorithm of m) as the underlying random projections, respectively. Note that within the algo¬ 
rithm (of |15j ) for approximating the leverage scores, the target approximation accuracy is set to 
be a constant. 


name 

running time 

s 


CT [63] 

0 (mn^ logn) 

O(nlogn) 

0{n log n) 

FCT [19] 

0{mn logn) 

0{nlogn) 

0{n^ log^ n) 

SPCT [66] 

nnz(A) 

0{n^ log® n) 

0{n^ log® n) 

Reciprocal Exponential [64] 

nnz(A) 

0{nlogn) 

0{n^ log^ n) 

Sampling (FCT) [THl 177] 

0{mn logn) 

iog9/2 7),log(l/e)/e^) 

1 + e 

Sampling (SPCT) [66l IT^ 177] 

0(nnz{A) ■ log n) 

Q^n^5/2 logi^/2 7T,log(l/e)/e2) 

1 + e 

Sampling (RET) [BTl [77] 

0(nnz{A) ■ log n) 

log®/^ nlog(l/e)/e^) 

1 + e 


Table 6: Summary of data-oblivious and data-aware ii embeddings. Above, s denotes the embed¬ 
ding dimension. By running time, we mean the time needed to compute HA. For each method, 
we set the failure rate to be a constant. Moreover, “Sampling (FCT)”, “Sampling (SPOT)” and 
“Sampling (RET)” denote the ii sampling algorithms obtained by using ECT, SPOT and RET 
as the underlying preconditioning methods, respectively. 


Remark 4 With these low-distortion I 2 subspace embeddings, one can use the QR-type method to 
compute an £2 preconditioner. That is, one can compute the QR factorization of the low-distortion 
subspace embeddings in Table\^and use R~^ as the preeonditioner; see Table\^for more details. 
We note that the tradeoffs in running time are implicit although they have the same conditioning 
quality. This is because the running time for computing the QR factorization depends on the 
embedding dimension which is varied from method to method. However, normally this is absorbed 
by the time for forming ^A (theoretieally, and it is in practice not the dominant effect). 


4.2.2 Data-oblivious low-distortion ii subspace embeddings 

General £p subspace embedding and even £i subspace embedding is quite different from £2 subspace 
embedding. Here, we briefly introduce some existing results on £i subspace embedding; for more 
general £p subspace embedding, Meng and Mahoney [66] and Clarkson and Woodruff [20] . 

For £ 1 , the first question to ask is whether there exists an J-L transform equivalent. This 
question was answered in the negative by Charikar and Sahai m- 
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Lemma 12 (Charikar and Sahai [TSj) There exists a set of 0{m) points in Pf' such that any 
linear embedding into i\ has distortion at least y/rn/s. The trade-off between dimension and 
distortion for linear embeddings is tight up to a logarithmic factor. There exists a linear embedding 
of any set of N points in Pf to £f where s' = 0{slogN) and the distortion is 0{^m/s). 

This result shows that linear embeddings are particularly “bad” in p , compared to the particularly 
“good” results provided by the J-L lemma for p- To obtain a constant distortion, we need s > Cm 
for some constant C. So the embedding dimension cannot be independent of m. However, the 
negative result is obtained by considering arbitrary point sets. In many applications, we are 
dealing with structured point sets, e.g., vectors from a low-dimensional subspace. In this case, 
Sohler and Woodruff [63] give the first linear oblivious embedding of a n-dimensional subspace 
of Pf into with distortion O(nlogn), where both the embedding dimension and the 

distortion are independent of m. In particular, they prove the following quality bounds. 

Lemma 13 (Cauchy transform (CT), Sohler and Woodruff [63j) Let Ai be an arbitrary 
n-dimensional linear subspace o/M™. Then there is an sq = so(n) = O(nlogn) and a sufficiently 
large constant Cq > 0, such that for any s with sq < s < and any constant C > Cq, if 

‘h G is a random matrix whose entries are choose independently from the standard Cauchy 

distribution and are scaled by C/s, then with probability at least 0.99, 

Iklli < ll^hxlli < 0{n\ogn) ■ ||x||i, Vx G . 

The proof is by constructing tail inequalities for the sum of half Cauchy random variables |63j . 
The construction here is quite similar to the construction of the dense Gaussian embedding for p 
in Lemma[2l with several important differences. The most important differences are the following: 

• Cauchy random variables replace standard normal random variables; 

• a larger embedding dimension does not always lead to better distortion quality; and 

• the failure rate becomes harder to control. 

As CT is the p counterpart of the dense Gaussian transform, the Fast Cauchy Transform 
(FCT) proposed by Clarkson et al. |19j is the p counterpart of FJLT. There are several re¬ 
lated constructions. For example, this FCT construction first preprocesses by a deterministic 
low-coherence matrix, then rescales by Cauchy random variables, and finally samples linear com¬ 
binations of the rows. Then, they construct <I> as 

$ = dHCH, 


where: 

• H G has each column chosen independently and uniformly from the s standard basis 

vectors for R^; for a sufficiently large, the parameter is set as s = anlog(n/(5), where 
b G (0,1) controls the probability that the algorithm fails; 

• C G ]R2mx2m -g ^ diagonal matrix with diagonal entries chosen independently from a Cauchy 
distribution; and 

• H £ ]^ 2 mxm -g ^ block-diagonal matrix comprised of m/t blocks along the diagonal. Each 

block is the 2t x t matrix Cs = where f is the t x t identity matrix, and Ht is 
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the normalized Hadamard matrix. (For simplicity, assume t is a power of two and m/t is 
an integer.) 

(Gs \ 



\ GsJ 

Informally, the effect of H in the above FCT construction is to spread the weight of a vector, 
so that Hy has many entries that are not too small. This means that the vector CHy comprises 
Cauchy random variables with scale factors that are not too small; and finally these variables are 
summed up by B, yielding a vector BCHy, whose £i norm won’t be too small relative to || 2 /||i. 
They prove the following quality bounds. 

Lemma 14 (Fast Cauchy Transform (FCT), Clarkson et al. |19j) There is a distribution 
(given hy the above construction) over matrices G with s = 0(nlogn + nlog(l/(5)), such 

that for an arbitrary (but fixed) A G and for all x G R”, the inequalities 

hold with probability 1 — 5, where 

K = O log(sn) 

Further, for any y G R™', the product ^y can he computed in O(mlogs) time. 

To make the algorithm work with high probability, one has to set t to be at the order of s® and 
s = O(nlogn). It follows that n = O(n^log^n) in the above theorem. That is, while faster 
in terms of FLOPS than the CT, the FCT leads to worse embedding/preconditioning quality. 
Importantly, this result is different from how FJLT compares to dense Gaussian transform: FJLT 
is faster than the dense Gaussian transform, while both provide the same order of distortion; but 
FCT becomes faster than the dense Cauchy transform, at the cost of somewhat worse distortion 
quality. 

Similar to [201 ESI ET] for computing an £2 subspace embedding, Meng and Mahoney [66] 
developed an algorithm for computing an £i subspace embedding matrix in input-sparsity time, 
he., in 0(nnz(^)) time. They used a CountSketch-like matrix which can be written as the product 
of two matrices = SC G R®^™, where S G R^^™' has each column chosen independently and 
uniformly from the s standard basis vectors of R'^ and C G is a diagonal matrix with 

diagonal entries chosen independently from the standard Cauchy distribution. We summarize the 
main theoretical results in the following lemma. 

Lemma 15 (Sparse Cauchy Transform (SPCT), Meng and Mahoney |66|) Given an n- 
dimensional subspace Ai C R™ and e G (0,1), there is s = 0(nfi log® n) such that with a constant 
probability, 

1/ 0{n^ log^ ^)lk||i < ll^a^lli < C>(nlogn)||x||i, Vx G Ai, 
where <I> is the sparse Cauchy transform described above. 

More recently. Woodruff and Zhang |64j proposed another algorithm that computes an £\ 
subspace embedding matrix in input-sparsity time. Its construction is similar to that of sparse 
Cauchy transform. That is, = SB where D is a diagonal matrix with diagonal entries 
l/ui,l/u 2 ,... A/un where ui are exponential variables. Comparing to sparse Cauchy trans¬ 
form, the embedding dimension and embedding quality have been improved. We summarize the 
main results in the following lemma. 
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Lemma 16 (Woodruff and Zhang [64]) Given an n-dimensional subspace Ai C M™' and e G 
(0,1), there is s = O(nlogn) such that with a constant probability, 

1/C>(nlogn)||x||i < ||4>x||i < C>(nlogn)||x||i, \/x G 
where <I> is the sparse transform using reciprocal exponential variables described above. 

To summarize, in Table (6] we provide a summary of the basic properties of several data- 
oblivious subspace embeddings discussed here (as well as of several data-aware li subspace¬ 
preserving embeddings that will be discussed in Section [4.2.4p . 


4.2.3 Data-aware low-distortion (.2 subspace embeddings 

All of the linear subspace embedding algorithms mentioned in previous subsections are oblivi¬ 
ous, i.e., independent of the input subspace. That has obvious algorithmic advantages, e.g., one 
can construct the embedding matrix without even looking at the data. Since using an oblivious 
embedding is not a hard requirement for the downstream task of solving ^p regression problems 
(and since one can use random projection embeddings to construct importance sampling probabil¬ 
ities m in essentially “random projection time,” up to small constant factors), a natural question 
is whether non-oblivious or data-aware embeddings could give better conditioning performance. 

In general, the answer is yes. 

As mentioned in Section 12.21 Drineas et al. m developed a sampling algorithm for solving 
£2 regression by constructing a (1 ± e)-distortion £2 subspace-preserving sampling matrix. The 
underlying sampling distribution is defined based on the statistical leverage scores of the design 
matrix which can be viewed as the “influence” of that row on the LS fit. That is, the sampling 
distribution is a distribution {pi}^i satisfying 

i = l,...,m. (15) 

Above {£i}^^ are the leverage scores of A and (3 G (0,1]. When /3 = 1 and ,0 < 1, (fTS]) implies 
we define {pi}™ 1 according to the exact and estimated leverage scores, respectively. 

More importantly, theoretical results indicate that, given a target desired accuracy, the re¬ 
quired sampling complexity is independent of the higher dimension of the matrix. Similar con¬ 
struction of the sampling matrix appeared in several subsequent works, e.g., naEnKis], with 
improved analysis of the sampling complexity. For completeness, we include the the main theo¬ 
retical result regarding the subspace-preserving quality below, stated here for £ 2 . 

Theorem 1 {£2 subspace-preserving sampling |1411291115j l Given an n-dimensional sub- 
space A 2 C M"* represented by amatrixA G ande G (0,1), chooses = 0(nlognlog(l/<5)//3e^), 

and construct a sampling matrix S G with diagonals 


— 



with probability qi, 
otherwise, 


i = 1,... ,m. 


where 


gi > min{l,s-pi} , i = l,...,m. 


and {pi}™ 1 satisfies (fT^ . Then, with probability at least 0.7, 


(1 - e)\\y \\2 < ||5’y||2 < (1 + e)||?/|| 2 , Vy G M 2 . 
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An obvious (but surmountable) challenge to applying this result is that computing the leverage 
scores exactly involves forming an orthonormal basis for A first. Normally, this step will take 
0{m'n?) time which becomes undesirable when for large-scale applications. 

On the other hand, by using the algorithm of m, computing the leverage scores approximately 
can be done in essentially the time it takes to perform a random projection: in particular, Drineas 
et al. |15] suggested that one can estimate the leverage scores by replacing A with a “similar” 
matrix in the computation of the pseudo-inverse (which is the main computational bottleneck in 
the exact computation of the leverage scores). To be more specific, by noticing that the leverage 
scores can be expressed as the row norms of AA\ we can use £2 subspace embeddings to estimate 
them. The high-level idea is, 

||eiAAt ||2 « ||eiA(niA)t ||2 « ||eiA(niA)tn 2 || 2 , 

where is a vector with zeros but 1 in the f-th coordinate, IIi G is a FJLT and 112 £ 

is a JLT which preserve the £2 norms of certain set of points. If the estimation of the leverage 

scores £i satisfies 

{'^-l)£i<£i<i} + l)£i, i = 

then it is not hard to show that a sampling distribution {pi}™! defined according to pi = 

satisfies (IlSp with /? = When 7 is constant, say 0.5, from Theorem [H the required sampling 
complexity will only need to be increased by a constant factor 1//3 = 3. This is less expensive, 
compared to the gain in the computation cost. 

Suppose, now, we use SRHT iLemmafTOl) or CW iLemmafTTI) method as the underlying FJLT, 
i.e., Hi, in the approximation of the leverage scores. Then, combining the theory suggested in m 
and Theorem [H we have the following lemma. 

Lemma 17 (Fast £2 subspace-preserving sampling (SRHT) [15] ) Given an n-dimensional 
subspace A 2 C represented by a matrix A G ^ £ (0,1), it takes 0{mnlogm) 

time to compute a sampling matrix S' G (with only one nonzero element per row) with 

s' = 0 (nlogn/e^) such that with constant probability 

(1 - e)\\y\\2 < Il'S'ylb < (1 + e)\\y\\ 2 , Vy G ^2 ■ 

Lemma 18 (Fast £2 subspace-preserving sampling fCWl [T5l I20| f Given an n-dimensional 
subspace A 2 C represented by a matrix A G and e G (0,1), it takes 0(nnz(A) • logm) 

time to compute a sampling matrix S £ M" (with only one nonzero element per row) with 
s' = C>(n log n/e^) such that with constant probability 

(1 - e)\\y \\2 < ||S’y||2 < (1 + e)||?/||2, Vy G ^2 ■ 

Remark 5 Although using GW runs asymptotically faster than using SRHT, due to the poorer 
embedding quality of CW, in order to achieve the same embedding quality, and relatedly the same 
quality results in applications to £2 regression, it may need a higher embedding dimension, i.e., 
ri. This results in a substantially longer QR factorization time for CW-based methods. 

Finally, recall that a summary of both data-oblivious and data-aware subspace embedding for 
£2 norm can be found in Table El 
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4.2.4 Data-aware low-distortion subspace embeddings 

In the same way as we can use data-aware embeddings for I 2 regression, we can also use data- 
aware embeddings for regression. Indeed, the idea of using data-aware sampling to obtain 
(1 zb e)-distortion subspace embeddings for l\ regression was first used in [21], where it was 
shown that an subspace embedding can be done by weighted sampling after preprocessing the 
matrix, including preconditioning, using ellipsoidal rounding. Sampling probabilities depend on 
the norms of the rows of the preconditioned matrix. Moreover, the resulting sample has each 
coordinate weighted by the reciprocal of its sampling probability. Different from oblivious 
subspace embeddings, the sampling approach can achieve a much better distortion. 

Lemma 19 (Clarkson |51j) Given an n-dimensional subspaee Ai C M™' represented by a ma¬ 
trix A E e, 5 E (0,1), it takes 0{mn^logm) time to compute a sampling matrix 

5 E (with only one nonzero element per row) with s' = 0{rfi'^\og{n/{5e))/e^) sueh that, 

with probability at least 1 — 5, 

(1 - e)||2/||i < 11*5^111 < (1 + e)||?/||i, Vy E . 

Therefore, to estimate the £i norms of any vector from a n-dimensional subspace of M™, we only 
need to compute the weighted sum of the absolute values of a few coordinates of this vector. 

Recall that the £2 leverage scores used in the £2 sampling algorithm described in Theorem [1] 
are the squared row norms of a orthonormal basis of A 2 which can be a viewed as a “nice” basis 
for the subspace of interest. Dasgupta et al. 135 ] generalized this method to the general £p case; in 
particular, they proposed to sample rows according to the £p row norms of AR~^, where AR~^ is 
a well-conditioned (in the £p sense of well-conditioning) basis for Ap. Different from £i sampling 
algorithm m described above, computing such matrix R is usually sufficient, meaning it is not 
needed to preprocess A and form the basis AR~^ explicitly. 


Theorem 2 {£p subspace-preserving sampling, Dasgnpta et al. [35]) Given an n-dimensional 
subspace Ap C MJ" represented by a matrix A E and a matrix R E such that AR~^ 

is well-conditioned, p E [l,oo), e E (0,1/7), and 5 E (0,1), choose 


s > IQ{2P + 2)KFp{AR ^)(nlog(12/e) log(2/5))/(/e^), 


and construct a sampling matrix S E R™^”^ with diagonals 


— 



with probability pi, 
otherwise, 


l,...,m. 


where 

Pi > min{l,s • \\aiR~'^\fp/\AR~^\P^] , i = l,...,m. 
Then, with probability at least 1 — 5, 


(1 - e)||y||p < ||5'y||p < (1 e)||y||p, Vy E .4p . 

In fact. Theorem [2] holds for any choice of R. When ii = /, it implies sampling according 
to the £p row norms of A and the sampling complexity replies on Rp{A). However, it is worth 
mentioning that a large condition number for A will leads to a large sampling size, which in turn 
affects the running time of the subsequent operations. Therefore, preconditioning is typically 
necessary. That is, one must find a matrix R E R”'^"' such that Rp{AR~^) = 0(poly(n)), which 
could be done by the preconditioning algorithms introduced in the previous sections. 
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Given R such that AR~^ is well-conditioned, computing the row norms of AR~^ takes 
0{myz{A) ■ n) time. Clarkson et al. [19] improve this running time by estimating the row norms 
of AR~^ instead of computing them exactly. The central idea is to post-multiply a random 
projection matrix 112 £ with r = 0{\ogm) which takes only 0{mi7,{A) • logm) time. 

If one uses FCT or SPOT in Table [D to compute a matrix R such that AR~^ is well-conditioned 
and then uses the above idea to estimate quickly the i\ row norms of AR~^ to define the sampling 
distribution, then by combining with Theorem [21 we have the following two results. 

Lemma 20 (Fast subspace-preserving sampling (FCT) |19l 177] 1 Given an n-dimensional 
subspaee Ai C M™' represented by a matrix A G g £ (0,1), it takes 0{mnlogm) 

time to compute a sampling matrix S' G (with only one nonzero element per row) with 

s' = 0{n^ log2 n log(l/e)/e^) such that with a constant probability, 

(1 - e)||x||i < ||S'x||i < (1 -I- e)||a:||i, Vx G . 

Lemma 21 (Fast I'l subspace-preserving sampling (SPCT) [66LI19L IT7j l Given an n-dimensional 
subspace Ai C represented by a matrix A G and e G (0,1), it takes 0(nnz(^) • logm) 

time to compute a sampling matrix S' G (with only one nonzero element per row) with 

s' = 0{n^ log "a" relog(l/e)/e^) such that with a constant probability, 

(1 — e)||x||i < ||S'x||i < (1 -I- e)||a:||i, Vx G . 

Remark 6 Fast sampling algorithm also exists for ip regression. That is, after computing a 
matrix R such that AR~^ is well-conditioned, one can use a similar idea to approximate the I 2 
row norms of AR~^, e.g., post-multiplying a random matrix with independent Gaussian variables 
(JLT), which lead to estimation of the ip row norms of AR~^ up to small factors; see JW) for 
more details. 

Remark 7 We note that the speed-up comes at the cost of increased sampling complexity, which 
does not substantially affect most theoretical formulations, since the sampling complexity is still 
0(poly(n) log(l/e)/e^). In practice, however, it might be worth computing U = AR~^ and its 
row norms explicitly to obtain a smaller sample size. One should be aware of this trade-off when 
implementing a subspace-preserving sampling algorithm. 

Finally, recall that a summary of both data-oblivious and data-aware subspace embeddings 
for li norm can be found in Table [H 

4.3 Application of rounding/embedding methods to £1 and £2 regression 

In this subsection, we will describe how the ellipsoidal rounding and subspace embedding methods 
described in the previous subsections can be applied to solve £2 and regression problems. In 
particular, by combining the tools we have introduced in the previous two subsections, e.g., solving 
subproblems and constructing preconditioners with ellipsoid rounding and subspace-embedding 
methods, we are able to describe several approaches to compute very fine (1 -|- e) relative-error 
solutions to Ip regression problems. 

Depending on the downstream task of interest, e.g., how the solution to the regression problem 
will be used, one might be interested in obtaining low-precision solutions, e.g., e = 10“^, medium- 
precision solutions, e.g., e = 10“^, or high-precision solutions, e.g., e = 10“^°. As described in 
Section [21 the design principles for these cases are somewhat different. In particular, the use of I 2 
and Ii well-conditioned bases is somewhat different, depending on whether or not one is interested 
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type 

precision 

example 

reference 

embedding -|- solving subproblem 

direct solver 

low 

high 

CW + (FJLT+SVD) 
appr. lev. samp. (SRHT) -|- SVD 
SVD or QR 

m 

m 

m 

PC -|- direct solver 

high 

PC (Gaussian) -|- normal equation 

m 

PC -|- iterative alg. 

high 

PC (FJLT) + LSQR 

[SB ED] 


Table 7: Summary of RandNLA-based regression solvers; PC stands for preconditioning. 


type 

precision 

example 

reference 

(PC -|- sampling) -|- solving subproblem 

second-order 

low 

high 

(ER/fast ER -|- sampling) -|- IPM 
(SCT/FCT + sampling) + IPM 
IPM 

[511 [35] 
|63l[l9] 

m 

PC -|- first-order 

high 

ER -|- accelerated gradient descent 

m 


Table 8: Summary of RandNLA-based l\ regression solvers; PC stands for preconditioning. 


in low precision. Here, we elaborate on how we can use the methods described previously construct 
low-precision solvers and high-precision solvers for solving Ip regression problems. As a reference, 
see Table [7] and Table [8] for a summary of several representative RandNLA algorithms for solving 
£2 and regression problems, respectively. (Most of these have been previously introduced for 
smaller-scale computations in RAM; and in Section [5] we will describe several variants that extend 
to larger-scale parallel and distributed environments.) 

4.3.1 Low-precision solvers 

The most straightforward use of these methods (and the one to which most of the theory has been 
developed) is to construct a subspace-preserving embedding matrix and then solve the resulting 
reduced-sized problem exactly, thereby obtaining an approximate solution to the original problem. 
In somewhat more detail, this algorithmic approach performs the following two steps. 

1. Construct a subspace-preserving embedding matrix H with distortion 1 ± |. 

2. Using a black-box solver, solve the reduced-sized problem exactly, i.e., exactly solve 

X = min \\YiAx — H^IL. 

(We refer to this approach as low-precision since the running time complexity with respect to 
the error parameter e is poly(l/e). Thus, while this approach can be analyzed for a fixed e, this 
dependence means that as a practical matter this approach cannot achieve high-precision solu¬ 
tions.) 

To see why this approach gives us a (1 -|- e)-approximate solution to the original problem, 
recall that a subspace-preserving embedding matrix H with distortion factor (1 ± |) satishes 

(1 — e/4) • ||Ax||p < ||nAx||p < (1 + e/4) • ||Ax||p, Vx G M"". 

Therefore, the following simple reasoning shows that x is indeed a (1-|-e)-approximation solution. 

For completeness, we include the following lemma stating this result more precisely. 
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Lemma 22 Given an ip regression problem specified by A ^ j^mxn ^ ^ [l,oo) using the 
constrained formulation Q, let ^ be a {l±e/A)-distortion embedding of Ap, and x be an optimal 
solution to the reduced-sized problem ||<I>74x||p. Then x is a {1 -\- e)-approximate solution 

to the original problem. 

A great deal of work has followed this general approach. In particular, the meta-algorithm 
for £2 regression from Section [2] is of this general form. Many other authors have proposed 
related algorithms that require solving the subproblem by first computing a subspace-preserving 
sampling matrix. See, e.g., [T] and references therein. Here, we simply cite several of the most 
immediately-relevant for our subsequent discussion. 

• Sampling for £2 regression. One could use the original algorithm of 0117 ], which performs 
a data-aware random sampling and solves the subproblem in 0{mn^) time to obtain an 
approximate solution. Using the algorithm of m, the running time of this method was 
improved to roughly 0{mnlog{n)) time, and by combining the algorithm of [TB] with the 
algorithm of [20|, the running time was still further improved to input-sparsity time. 

• Projections for £2 regression. Alternatively, one could use the algorithm of [281129j . which 
performs a data-oblivious Hadamard-based random projection and solves the subproblem 
in roughly 0{mnlog{n)) time, or one could use the algorithm of [20], which runs in input- 
sparsity time. 

• Sampling and projections for £i and £p regression. See (HU |63l [19] and see [35l [66l [20] and 
references therein for both data-oblivious and data-aware methods. 

To summarize these and other results, depending on whether the idealization that m 3> n holds, 
either the Hadamard-based projections for £2 regression {e.g., the projection algorithm of [29] 
or the sampling algorithm of m combined with the algorithm of m) and £i regression {e.g., 
the algorithm of [19] ) or the input-sparsity time algorithms for £2 and £i regression {e.g., the 
algorithms of [20] and [66]) lead to the best worst-case asymptotic performance. There are, 
however, practical tradeoffs, both in RAM and in parallel-distributed environments, and the 
most appropriate method to use in any particular situation is still a matter of ongoing research. 

4.3.2 High-precision solvers 

A more rehned use of these methods (and the one that has been used most in implementations) 
is to construct a subspace-preserving embedding matrix and then use that to construct a pre¬ 
conditioner for the original £p regression problem, thereby obtaining an approximate solution to 
the original problem. In somewhat more detail, this algorithmic approach performs the following 
two steps. 

1. Construct a randomized preconditioner for A, called N. 

2. Invoke an iterative algorithm whose convergence rate depends on the condition number of 
the problem being solved (a linear system for £2 regression, and a linear or convex program 
for £i regression) on the preconditioned system AN. 

(We refer to this approach as high-precision since the running time complexity with respect to 
the error parameter e is log(l/e). Among other things, this means that, given a moderately 
good solution— e.g., the one obtained from the embedding that could be used in a low-precision 
solver—one can very easily obtain a very high precision solution.) 

Most of the work for high-precision RandNLA solvers for £p regression has been for £2 regres¬ 
sion (although we mention a few solvers for £i regression for completeness and comparison). 
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• For ^2 regression. Recall that theoretical (and empirical) results suggest that the required 
number of iterations in many iterative solvers such as LSQR [l3] depends strongly on the 
condition number of the system. Thus, a natural idea is first to compute a randomized pre¬ 
conditioner and then to apply one of these iterative solvers on the preconditioned system. 
For example, if we use SRHT (Lemma fTIH) to create a preconditioned system with condi¬ 
tion number bounded by a small constant and then use LSQR to solve the preconditioned 
problem iteratively, the total running time would be 0{mn\og{m/e) + n^logn), where 
0(mnlog(m)) comes from SRHT, 0{rfi\ogn) from computing the preconditioner matrix, 
and 0{mn\og{l/e)) from LSQR iterations. Authors in [311 ED] developed algorithms that 
use FJLT for preconditioning and LSQR as an iterative solver. In [32|, the authors de¬ 
veloped a randomized solver for (.2 regression using Gaussian transform and LSQR or the 
Chebyshev semi-iterative method; see Section 15.11 for more details. 

As with the low-precision solvers, note that if we use the input-sparsity time algorithm of [20| 
for embedding and then use an (SRHT -|- LSQR) approach above to solve the reduced¬ 
sized problem, then under the assumption that m > poly(n) and e is fixed, this particular 
combination would become the best approach proposed. However, there are various trade¬ 
offs among those approaches. For instance, there are trade-offs between running time and 
conditioning quality in preconditioning for computing the subspace-preserving sampling 
matrix, and there are trade-offs between embedding dimension/sample size and failure rate 
in embedding/sampling. Some of the practical trade-offs on different problem types and 
computing platforms will be discussed in Section 15.31 below. 

• For regression. While most of the work in RandNLA for high-precision solvers has been for 
£2 regression, we should point out related work for (.1 regression. In particular, Nesterov [5Tj 
proposed an algorithm that employs a combination of ellipsoid rounding and accelerated 
gradient descent; and second-order methods from m use interior point techniques more 
generally. See also the related solvers of Portnoy et al. [80] [H]. For regression, Meng and 
Mahoney [62] coupled these ideas with RandNLA ideas to develop an iterative medium- 
precision algorithm for £1 regression; see Section [5.21 for more details. 


5 Implementations and empirical results 

In this section, we describe several implementations in large-scale computational environments 
of the theory described in Section |4l In particular, in Section 15.11 we will describe LSRN, an (.2 
regression solver appropriate for parallel environments using multi-threads and MPI; and then 
in Section 15.21 we will describe the results of both a low-precision algorithm as well as a related 
medium-precision iterative algorithm for the £1 regression problem. Both of these subsections 
summarize recent previous work, and they both illustrate how implementing RandNLA algo¬ 
rithms in parallel and distributed environments requires paying careful attention to computation- 
communication tradeoffs. These prior results do not, however, provide a comprehensive evaluation 
of any particular RandNLA method. Thus, for completeness, we also describe in Section [5.31 sev- 
eral new results: a comprehensive empirical evaluation of low-preeision, medium-precision, and 
high-precision random sampling and random projection algorithms for the very overdetermined £2 
regression problem. Hereby, by “medium-precision”, typically we mean calling a high-precision 
solver but executing less iterations in the underlying iterative solver. These implementations 
were done in Apache SoarlJ^: they have been applied to matrices of up to terabyte size; and they 

1C 

http://spark.apache.org 
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illustrate several points that will be important to understand as other RandNLA algorithms are 
implemented in very large-scale computational environments. 

5.1 Solving (.2 regression in parallel environments 

In this subsection, we describe implementation details for a high-precision £2 regression solver 
designed for large-scale parallel environments. LSRN [32] is designed to solve the minimum-length 
least squares problem ([T|) to high precision; and it works for linear systems that are either strongly 
over-determined, he., m 3> n or strongly under-determined, he., m <C n, and possibly rank- 
deficient. LSRN uses random normal projections to compute a preconditioner matrix such that 
the preconditioned system is provably extremely well-conditioned. In particular, either LSQR [l3| 
(a conjugate gradient based method) or the Chebyshev semi-iterative (CS) method [IS] can be 
used at the iterative step to compute the min-length solution within just a few iterations. As we 
will describe, the latter method is preferred on clusters with high communication cost. Here, we 
only present the formal description of the Algorithm LSRN for strongly over-determined systems 
in Algorithm [2j 


Algorithm 2 LSRN for strongly over-determined systems 
1: Choose an oversampling factor 7 > 1, e.g., 7 = 2. Set s = \'yn]. 

2: Generate G = randn(s,m), a Gaussian matrix. 

3 : Compute A = GA. 

4 : Compute A’s economy-sized SVD: . 

5 : Let N = (Note: that this is basically R~^ from QR on the embedding, but it is 

written here ito the SVD.) 

6: Iteratively compute the min-length solution y to 

minimizej^gRr 11 ANy - 6112. 

7 : Return x = Ny. 


Two important aspects of LSRN are the use of the Gaussian transform and the CS method, and 
they are coupled in a nontrivial way. In the remainder of this subsection, we discuss these issues. 

To start, note that, among the available choices for the random projection matrix, the Gaus¬ 
sian transform has particularly-good conditioning properties. In particular, the distribution of 
the spectrum of the preconditioned system depends only on that of a certain Gaussian matrix, 
not the original linear system. In addition, one can show that 

r (n(AN) < l + °+v)T) > 1 _ 

y l-a- aJt/s) 

where k{AN) is the condition number of the preconditioned system, r is the rank of A, and a is a 
parameter [32] . For example, if we choose the oversampling factor 7 in Algorithm [2] to be 2, then 
the condition number of the new linear system is less than 6 with high probability. In addition, 
a result on bounds on the singular values provided in [32| enable CS to work more efficiently. 

Moreover, while slower in terms of FLOPS than FFT-based fast transforms, the Gaussian 
transform comes with several other advantages for large-scale environments. First, it automati¬ 
cally speeds up with sparse input matrices and fast linear operators (in which case FFT-based fast 
transforms are no longer “fast”). Second, the preconditioning process is embarrassingly parallel 
and thus scales well in parallel environments. Relatedly, it is easy to implement using multi¬ 
threads or MPI. Third, it still works (with an extra “allreduce” operation) when A is partitioned 
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Listing 1: One iteration in LSQR 


u = A.matvec(v) — alpha*u 

beta = sqrt (comm. allr educe (np . dot (u , u))) 

V = comm, allreduce (A. rmatvec (u)) — beta^v 


Listing 2: One iteration in CS 

V = comm, allreduce (A. rmatvec ( r )) — beta*v 
X += alpha*v 
r —= alplia*A. matvec (v) 


Figure 2: Python code snippets for LSQR-based and CS-based iterations, respectively, illustrating 
that the latter has one synchronization point periteration, while the former has two. 


along its bigger dimension. Lastly, when implemented properly, Gaussian random variables can be 
generated very fast [82] (which is nontrivial, given that the dominant cost in naively-implemented 
Gaussian-based projections can be generating the random variables). For example, it takes less 
than 2 seconds to generate 10® random Gaussian numbers using 12 GPU cores [32]. 

To understand why CS is preferable as a choice of iterative solver compared to other methods 
such as the conjugate gradient based LSRN, one has to take the convergence rate and computa¬ 
tion/communication costs into account. In general, if (a bound for) the condition number of the 
linear system is large or not known precisely, then the CS method will fail ungracefully (while 
LSQR will just converge very slowly). However, with the very strong preconditioning guarantee of 
the Gaussian transform, we have very strong control on the condition number of the embedding, 
and thus the CS method can be expected to converge within a very few iterations. In addition, 
since CS doesn’t have vector inner products that require synchronization between nodes (while 
the conjugate gradient based LSQR does), CS has one less synchronization point per iteration, 
he., it has improved communication properties. See Figure [2] for the Python code snippets of 
LSQR and CS, respectively. On each iteration, both methods have to do two matrix-vector mul¬ 
tiplications, while CS only needs one cluster-wide synchronization compared to two in LSQR. 
Thus, the more communication-efficient CS method is enabled by the very strong control on con¬ 
ditioning that is provided by the more expensive Gaussian projection. It is this advantage that 
makes CS favorable in the distributed environments, where communication costs are considered 
more expensive. 

5.2 Solving £i regression in distributed environments 

In this subsection, we describe implementation details for both low-precision and high-precision 
solvers for the regression problem in large-scale distributed environments. These algorithms 
were implemented using MapReduce framework [3] which (at least until the relatively recent 
development of the Apache Spark framework) was the de facto standard parallel environment for 
analyzing massive datasets. 

Low-precision solver Recall that one can use the sampling algorithm described in Section [4.31 
to obtain a low-precision approximate solution for £i regression. This can be summarized in the 
following three steps. 
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1. Compute an £i-well-conditioned basis U = AR ^ for A. 

2. Construct an importance sampling distribution {pi}™ ^ based on the row norms of U. 
Randomly sample a small number of constraints according to {pi}^i to construct a sub¬ 
problem. 

3. Solve the £i-regression problem on the subproblem. 

Next, we will discuss some of the implementation details of the above three steps in the 
MapReduce framework. The key thing to note is that, for the problems we are considering, the 
dominant cost is the cost of input/output, i.e., communicating the data, and hence we want to 
extract as much information as possible for each pass over the data. 

The first step, as described in Section 14.31 is to construct an ii well-conditioned basis for 
A] and for this one can use one of the following three methods—ellipsoid rounding (ER), a QR 
factorization of HA, where 11^ is a low-distortion subspace embedding matrix in terms of 
norm (QR), or a combination of these two (QR-I-ER method). See Tabled for summary of these 
approaches to conditioning. Note that many conditioning methods are embarrassingly parallel, 
in which case it is straightforward to implement them in MapReduce. For example, the Cauchy 
transform (CT) with embedding dimension r can be implemented in the following manner. 

Mapper: 

1: For each row a* of A, generate a vector Ci € consisting r standard Cauchy random 

variables. 

2: For j = 1,.. .r, emit {j, CijOi) where Cij denotes the j-th element of Cj. 

Reducer: 

1: Reduce vectors associated with key k to Vk with addition operation. 

2: Return Vk- 

After collecting all the vectors v^, for k = 1,... ,r, one only has to assemble these vectors and 
perform QR decomposition on the resulting matrix, which completes the preconditioning process. 

With the matrix R~^ such that AR~^ is well-conditioned, a second pass over the dataset 
is sufficient to construct a subproblem and obtain several approximate solutions to the original 
problem, i.e., the second and three steps of the sampling algorithm above. Note that since com¬ 
putation is a less precious resource than communication here, one can exploit this to compute 
multiple subsampled solutions in this single pass. (E.g., performing, say, 100 matrix-vector prod¬ 
ucts is only marginally more expensive than performing 1, and thus one we can solve multiple 
subsampled solutions in a single “pass” with almost no extra effort. To provide an example, on a 
10-node Hadoop cluster, with a matrix of size ca. 10® x 50, a single query took 282 seconds, while 
100 queries took only 383 seconds, meaning that the extra 99 queries come almost “for free.”) 
We summarize the basic steps as follows. Assume that A G condition number ki , s 

is the sampling size and is the number of approximate solutions desired. Then the following 
algorithm returns approximate solutions to the original problem. 

Mapper: 

1: For each row a* of A, let pi = min{s||aj||i/(Kin^/^), 1}. 

2: For k = 1,..., Ux, emit {k, aijpi) with probability pi. 

Reducer: 

1: Collect row vectors associated with key k and assemble A^. 

2: Compute Xfc = argmin^T^-^i ||Afcx||i using interior-point methods. 

3 : Return Xk- 


37 




passes 

extra work per pass 

subgradient [51] 

C)(nVe") 


gradient [3l] 



ellipsoid [HH] 

0{n‘^ log(Ki/e)) 


IPCPM [H] 

C>(n log(Ki/e)) 

logn) 


Table 9: Iterative algorithms for solving l\ regression. 


Note here, in the second step of the reducer above, since the size of the subsampled matrix Ak 
typically only depends on the low dimension n, the subproblem can be fit into the memory of a 
single machine and can be solved locally. 

As an aside, note that such an algorithm can be used to compute approximate solutions for 
other problems such as the quantile regression problem by only increasing the sampling size by 
a constant factor. In the authors evaluate the empirical performance of this algorithm by 
using several different underlying preconditioners, e.^., CT, FCT, etc., on a terabyte-size dataset 
in Hadoop to solve regression and other quantile regression problems. 

High-precision solver To obtain a high-precision solution for the £1 regression problem, we 
have to resort to iterative algorithms. See Table O where we summarize several iterative al¬ 
gorithms in terms of their convergence rates and complexity per iteration. Note that, among 
these methods, although IPCPM (interior point cutting plane methods) needs additional work at 
each iteration, the needed of number of passes is linear in the low dimension n and it only has 
a dependence on log(l/e). Again, since communication is a much more precious resource than 
computation in the distributed application where this was implemented, this can be an acceptable 
tradeoff when, e.g., a medium-precision solution is needed. 

Meng and Mahoney [62] proposed a randomized IPCPM algorithm to solve the ii regression 
problem to medium precision in large-scale distributed environments. It includes several features 
specially-designed for MapReduce and distributed computation. (To describe the method, recall 
that IPCPM is similar to a bisection method, except that it works in a high dimensional space. It 
starts with a search region 5o = {x | S'x < t}, which contains a ball of desired solutions described 
by a separation oracle. At step k, we first compute the maximum-volume ellipsoid £k inscribing 
Sk- Let Uk be the center of and send Uk to the oracle. If Uk is not a desired solution, the oracle 
returns a linear cut that refines the search region Sk —)• Sk+i-) The algorithm of [62] is different 
from the standard IPCPM, mainly for the following two reasons. 

• Initialization nsing all the solutions returned by sampling algorithms. To con¬ 
struct a search region Sq , one can use the multiple solutions returned by calling the sampling 
algorithm, e.g., low-precision solutions, to obtain a much better initial condition. If we de¬ 
note by xi,...X]\f the N approximation solution, then given each x, let / = ||Ax||i and 
g = A^sign(Ax). Note that given xi,... ,xn, computing fi, gi ior i = 1,..., N can be done 
in a single pass. Then we have 

||x* — x\\2 < ||A(x* — x)||i < ||Ax*||i -|- ||Ax||i < 2/. 

Hence, for each subsampled solution Xj, we have a hemisphere that contains the optimal 
solution. We use all these hemispheres to construct a better initial search region 5o, which 
may potentially reduce the number of iterations needed for convergence. 

• Performing multiple queries per iteration. Instead of sending one query point at each 
iteration, one can exploit the fact that it is inexpensive to compute multiple query points 
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per iteration, and one can send multiple query points at a time. Let us still use Xi to denote 
the multiple query points. Notice that by convexity, 

||^a^*||i > M^lli + ff'{x* — x). 

This implies g'^x* < g"^x. That is, given any query point x, the subgradient serves as a 
separation oracle which returns a half-space that contains the desired ball. This means that, 
for each query point Xi, a half-space containing the ball of desired solutions will be returned. 

Note that both of these differences take advantage of performing extra computation while mini¬ 
mizing the number of iterations (which is strongly correlated with communication for MapReduce 
computations). 

5.3 Detailed empirical evaluations of £2 regression solvers in parallel/distributed 
environments 

In this subsection, we provide a detailed empirical evaluation of the performance of RandNLA 
algorithms for solving very over-determined very large-scale £2 regression problems. Recall that 
the subspace embedding that is a crucial part of RandNLA algorithms can be data-aware (i.e., 
a sampling algorithm) or data-oblivious (i.e., a projection algorithm). Recall also that, after 
obtaining a subspace embedding matrix, one can obtain a low-precision solution by solving the 
resulting subproblem, or one can obtain a high-precision solution by invoking a iterative solver, 
e.g., LSQR [l3], for £2 regression, with a preconditioner constructed from by the embedding. 
Thus, in this empirical evaluation, we consider both random sampling and random projection 
algorithms, and we consider solving the problem to low-precision, medium-precision, and high- 
precision on a suite or data sets chosen to be challenging for different classes of algorithms. We 
consider a range of matrices designed to “stress test” all of the variants of the basic meta-algorithm 
of Section [2] that we have been describing, and we consider matrices of size ranging up to just 
over the terabyte size scale. 


5.3.1 Experimental setup 

In order to illustrate a range of uniformity and nonuniformity properties for both the leverage 
scores and the condition number, we considered the following four types of datasets. 

• UG (matrices with uniform leverage scores and good condition number); 

• UB (matrices with uniform leverage scores and bad condition number); 

• NG (matrices with nonuniform leverage scores and good condition number); 

• NB (matrices with nonuniform leverage scores and bad condition number). 

These matrices are generated in the following manner. For matrices with uniform leverage scores, 
we generated the matrices by using the commands that are listed in Table flOl For matrices with 
nonuniform leverage scores, we considered matrices with the following structure: 


A = 




where B G x( 1 ^/ 2 ) jg random matrix with each element sampled from W(0,1), / G 

]^('^/ 2 )x(d/ 2 ) -g i(ientity matrix, and R G ^ is a random matrix generated using 

le -8 * rand(m-d/2,d/2) . In this case, the condition number of A is controlled by a. It is 
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u = 

orth (randn (m. 

n)); 

s = 

diag (linspace 

(1 ,1/kappa,n)) 

V = 

orth(randn(n, 

n)); 

A = 

U*S*V’ ; 


X = 

randn(n,1); 


b = 

A=f:x; 


err 

= randn(m, 1); 



b = b + 0.25*norm(b)/norm( err )* err ; 


Table 10: Commands (presented in MATLAB format) used to generate matrices with uniform 
leverage scores, i.e., the UG and UB matrices. Here, kappa is a parameter used to determine the 
condition number of the generated matrix. 


NAME 

CONDITION NUMBER 

LEVERAGE SCORES 

COHERENCE 

STACKl 

STACK2 

(for NB AND NG only) 

unchanged 

INCREASED 

DIVIDED BY REPNUM 

UNKNOWN 

DIVIDED BY REPNUM 

ALWAYS 1 


Table 11: Summary of methods for stacking matrices, to generate matrices too large to fit into 
RAM; here, REPNUM denotes the number of replications and coherence is defined as the largest 
leverage score of the matrix. 


worth mentioning that the last d/2 rows of the above matrix have leverage scores exactly 1 and 
the rest ones are approximately d/2/{n — d/2). Also, for matrices with bad condition number, the 
condition number is approximately le6 (meaning 10®); while for matrices with good condition 
number, the condition number is approximately 5. 

To generate a large-scale matrix that is beyond the capacity of RAM, and to evaluate the 
quality of the solution for these larger inputs, we used two methods. First, we replicate the matrix 
(and the right hand side vector, when it is needed to solve regression problems) REPNUM times, 
and we “stack” them together vertically. We call this naive way of stacking matrices as STACKl. 
Alternatively, for NB or NG matrices, we can stack them in the following manner: 



/aB 

R\ 

A = 

aB 

R 


\ 0 

IJ 


We call this stacking method STACK2. The two different stacking methods lead to different 
properties for the linear system being solved—we summarize these in Table fTTI— and, while they 
yielded results that were usually similar, as we mention below, the results were different in 
certain extreme cases. With either method of stacking matrices, the optimal solution remains the 
same, so that we can evaluate the approximate solutions of the new large least-squares problems. 
We considered these and other possibilities, but in the results reported below, unless otherwise 
specified we choose the following: for large-scale UG and UB matrices, we use STACKl to generate 
the data; and, for large-scale NG and NB matrices, we use STACK2 to generate the data. 

Recall that Table [5] provides several methods for computing an £2 subspace embedding matrix. 
Since a certain type of random projection either can be used to obtain an embedding directly or 
can be used (with the algorithm of [15]) to approximate the leverage scores for use in sampling. 
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we consider both data-aware and data-oblivious methods. Throughout our evaluation, we use the 
following notations to denote various ways of computing the subspace embedding. 

• PROJ CW — Random projection with the input-sparsity time CW method 

• PROJ GAUSSIAN — Random projection with Gaussian transform 

• PROJ RADEMACHER — Random projection with Rademacher transform 

• PROJ SRDHT — Random projection with Subsampled randomized discrete Hartley trans¬ 
form [85] 

• SAMP APPR — Random sampling based on approximate leverage scores 

• SAMP UNIF — Random sampling with uniform distribution 

Note that, instead of using a vanilla SRHT, we perform our evaluation with a SRDHT (ie., 
a subsampled randomized discrete Hartley transform). (An SRDHT is a related FFT-based 
transform which has similar properties to a SRHT in terms of speed and accuracy but doesn’t have 
the restriction on the dimension to be a power of 2.) Also note that, instead of using a distributed 
FFT-based transform to implement SRDHT, we treat the transform as a dense matrix-matrix 
multiplication, hence we should not expect SRDHT to have computational advantage over other 
transforms. 

Throughout this section, by embedding dimension, we mean the projection size for projection 
based methods and the sampling size for sampling based methods. Also, it is worth mentioning 
that for sampling algorithm with approximate leverage scores, we fix the underlying embedding 
method to be PROJ CW and the projection size c to be (i^/4. In our experiments, we found that— 
when they were approximated sufficiently well—the precise quality of the approximate leverage 
scores do not have a strong influence on the quality of the solution obtained by the sampling 
algorithm. We will elaborate this more in Section 15.3.31 

The computations for Table fT^ Figure 0] and Table [13] below (i.e., for the smaller-sized 
problems) were performed on a shared-memory machine with 12 Intel Xeon CPU cores at clock 
rate 2GHz with 128GB RAM. In these cases, the algorithms are implemented in MATLAB. All 
of the other computations (i.e., for the larger-sized problems) were performed on a cluster with 
16 nodes (1 master and 15 slaves), each of which has 8 CPU cores at clock rate 2.5GHz with 
25GB RAM. For all these cases, the algorithms are implemented in Spark via a Python API. 

5.3.2 Overall performance of low-precision solvers 

Here, we evaluate the performance of the 6 kinds of embedding methods described above (with 
different embedding dimension) on the 4 different types of dataset described above (with size le7 
by 1000). For dense transforms, e.g., PROJ GAUSSIAN, due to the memory capacity, the largest 
embedding dimension we can handle is 5e4. For each dataset and each kind of the embedding, we 
compute the following three quantities: relative error of the objective |/ —/*|//*; relative error of 
the solution certificate ||x — x*||2/||x*||2; and the total running time to compute the approximate 
solution. The results are presented in Figure [3| 

As we can see, when the matrices have uniform leverage scores, all the methods including 
SAMP UNIF behave similarly. As expected, SAMP UNIF runs fastest, followed by PROJ CW. On the 
other hand, when the leverages scores are nonuniform, SAMP UNIF breaks down even with large 
sampling size. Among the projection based methods, the dense transforms, i.e., PROJ GAUSSIAN, 
PROJ RADEMACHER and PROJ SRDHT, behave similarly. Although PROJ CW runs much faster, it 
yields very poor results until the embedding dimension is large enough, i.e., c = 3e5. Meanwhile, 
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sampling algorithm with approximate leverage scores, ie., SAMP APPR, tends to give very reliable 
solutions. (This breaks down if the embedding dimension in the approximate leverage score 
algorithm is chosen to be too small.) In particular, the relative error is much lower throughout all 
choices of the embedding dimension. This can be understood in terms of the theory; see mm 
and [5S] for details. In addition, its running time becomes more favorable when the embedding 
dimension is larger. 

As a more minor point, theoretical results also indicate that the upper bound of the relative 
error of the solution vector depends on the condition number of the system as well as the amount 
of mass of b lies in the range space of A, denote by 7 [TS]. Across the four datasets, 7 is roughly 
the same. This is why we see the relative error of the certificate, i.e., the vector achieving the 
minimum solution, tends to be larger when the condition number of the matrix becomes higher. 

5.3.3 Quality of the approximate leverage scores 

Here, we evaluate the quality of the fast approximate leverage score algorithm of m, and we 
investigate the quality of the approximate leverage scores with several underlying embeddings. 
(The algorithm of [15j considered only Hadamard-based projections, but other projection methods 
could be used, leading to similar approximation quality but different running times.) We consider 
only an NB matrix since leverage scores with nonuniform distributions are harder to approximate. 
In addition, the size of the matrix we considered is only rather small, le6 by 500, due to the need 
to compute the exact leverage scores for comparison. Our implementation follows closely the main 
algorithm of m , except that we consider other random projection matrices. In particular, we used 
the following four ways to compute the underlying embedding: namely, PRO J CW, PRO J GAUSSIAN, 
PROJ RADEMACHER, and PROJ SRDHT. For each kind of embedding and embedding dimension, we 
compute a series of quantities which characterize the statistical properties of the approximate 
leverage scores. The results are summarized in Table [T2j 
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sketch size 


(b) i/-ri/iri 

le7 X 1000 UG matrix 



le7 X 1000 UB matrix 



sketch size 


(h) i/-ri/iri 

le7 X 1000 NG matrix 



le7 X 1000 NB matrix 






Figure 3: Evaluation of all 6 of the algorithms on the 4 different types of matrices of size le7 
by 1000. For each method, the following three quantities are computed: relative error of the 
objective \f — f*\/f*', relative error of the certificate ||x — x*||2/||x*||2; and the running time to 
compute the approximate solution. Each subplot shows one of the above quantities versus the 
embedding dimension, respectively. For each setting, 3 independent trials are performed and the 
median is reported. 
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c 

PROJ CW 

PROJ GAUSSIAN PROJ RADEMACHER 

PROJ SRDH 


\\p-p*h/\\p*\\2 

5e2 

0.9205 

0.7738 

0.7510 

0.5008 

1e3 

0.9082 

0.0617 

0.0447 

0.0716 

5e3 

0.9825 

0.0204 

0.0072 

0.0117 

1e4 

0.9883 

0.0143 

0.0031 

0.0075 

5e4 

0.9962 

0.0061 

0.0006 

0.0030 

1e5 

0.0016 

0.0046 

0.0003 

0.0023 


Dkl{p*\\p) 

5e2 

18.5241 

0.0710 

0.6372 

0.1852 

1e3 

19.7773 

0.0020 

0.0015 

0.0029 

5e3 

20.3450 

0.0002 

0.0001 

0.0001 

1e4 

20.0017 

0.0001 

0.0001 

0.0001 

5e4 

19.2417 

1.9E-5 

I.Oe-5 

I.Oe-5 

1e5 

0.0001 

I.Oe-5 

5e-6 

5e-6 


Ul = 

5e2 

28.6930 

7.0267 

7.3124 

4.0005 

1e3 

11.4425 

1.1596 

1.1468 

1.2201 

5e3 

50.3311 

1.0584 

1.0189 

1.0379 

1e4 

82.6574 

1.0449 

1.0099 

1.0199 

5e4 

218.9658 

1.0192 

1.0018 

1.0094 

1e5 

1.0016 

1.0108 

1.0009 

1.0060 


as = max,{pf/p*’*} 

5e2 

0 

24.4511 

16.8698 

4.5227 

1e3 

0 

1.3923 

1.3718 

1.3006 

5e3 

0 

1.1078 

1.1040 

1.1077 

1e4 

0 

1.0743 

1.0691 

1.0698 

5e4 

0 

1.0332 

1.0317 

1.0310 

1e5 

1.0236 

1.0220 

1.0218 

1.0198 


/3l = 

5e2 

0 

0.0216 

0.0448 

0.4094 

1e3 

0 

0.8473 

0.8827 

0.8906 

5e3 

0 

0.9456 

0.9825 

0.9702 

1e4 

0 

0.9539 

0.9916 

0.9827 

5e4 

0 

0.9851 

0.9982 

0.9922 

1e5 

0.9969 

0.9878 

0.9993 

0.9934 


j3s = mini{pf/p*’*} 

5e2 

0 

0.0077 

0.0141 

0.1884 

1e3 

0 

0.7503 

0.7551 

0.7172 

5e3 

0 

0.9037 

0.9065 

0.9065 

1e4 

0 

0.9328 

0.9306 

0.9356 

5e4 

0 

0.9704 

0.9691 

0.9710 

1e5 

0.9800 

0.9787 

0.9789 

0.9803 


Table 12: Quality of the approximate leverage scores. The test was performed on an NB matrix 
with size le6 by 500. In above, p denotes the distribution by normalizing the approximate leverage 
scores and p* denotes the exact leverage score distribution. Dkl{p\\q) is the KL divergence [SB] 
of q from p dehned as YliPi = {'^\P*i — 1} ^ — {'^\Pi < !}■ this case, p^ denotes 

the corresponding slice of p, and the quantities p^,p*’^ ,p*'^ are defined similarly. 

As we can see, when the projection size is large enough, all the projection-based methods to 
compute approximations to the leverage scores produce highly accurate leverage scores. Among 
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these projection methods, PROJ CW is typically faster but also requires a much larger projection 
size in order to yield reliable approximate leverage scores. The other three random projections 
perform similarly. In general, the algorithms approximate the large leverage scores (those that 
equal or are close to 1) better than the small leverage scores, since ol and Pl are closer to 1. 
This is crucial when calling SAMP APPR since the important rows shall not be missed, and it is a 
sufficient condition for the theory underlying the algorithm of m to apply. 

Next, we invoke the sampling algorithm for the £2 regression problem, with sampling size 
s = le4 by using these approximate leverage scores. We evaluate the relative error on both 
the solution vector and objective and the total running time. For completeness and in order to 
evaluate the quality of the approximate leverage score algorithm, we also include the results by 
using the exact leverage scores. The results are presented in Figured! 






A 


PROJ CW 

1-^ PROJ GAUSSIAN 

PROJ RADEMACHER 
— PROJ SRDHT 

» SAMP EXACT 




10^ 10“ 10^ 

projection size 


(b) i/-ri/iri 



projection size 
(c) Running tinie(sec) 


Figure 4: Performance of sampling algorithms with approximate leverage scores, as computed 
by several different underlying projections. The test was performed on an NB matrix of size le6 
by 500 and the sampling size was le4. Each subplot shows one of the following three quantities 
versus the projection size used in the underlying random projection phase: relative error of the 
objective |/ —/*|//*; relative error of the certificate ||x — x*||2/||a:*||2; and the running time. For 
each setting, 5 independent trials are performed and the median is reported. 


These results suggest that the precise quality of the approximate leverage scores does not sub¬ 
stantially affect the downstream error, i.e., sampling-based algorithms are robust to imperfectly- 
approximated leverage scores, as long as the largest scores are not too poorly approximated. 
(Clearly, however, we could have chosen parameters such that some of the larger scores were very 
poorly approximated, e.g., by choosing the embedding dimension to be too small, in which case 
the quality would matter. In our experience, the quality matters less since these approximate 
leverage scores are sufficient to solve £2 regression problems.) Finally, and importantly, note that 
the solution quality obtained by using approximate leverage scores is as good as that of using 
exact leverage scores, while the running time can be much less. 

5.3.4 Performance of low-precision solvers when n changes 

Here, we explore the scalability of the low-precision solvers by evaluating the performance of all 
the embeddings on NB matrices with varying n. We fix d = 1000 and let n take values from 2.5e5 
to leS. These matrices are generated by stacking an NB matrix with size 2.5e5 by 1000 REPNUM 
times, with REPNUM varying from 1 to 400 using STACKl. For conciseness, we fix the embedding 
dimension of each method to be either 5e3 or 5e4. The relative error on certificate and objective 
and running time are evaluated. The results are presented in Figure O 

Especially worthy mentioning is that when using STACKl, by increasing REPNUM, as we pointed 
out, the coherence of the matrix, i.e., the maximum leverage score, is decreasing, as the size is 
increased. We can clearly see that, when n = 2.5e5, i.e., the coherence is 1, PROJ CW fails. Once 
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the coherence gets smaller, i.e., n gets larger, the projection-based methods behave similarly and 
the relative error remains roughly the same as we increased n. This is because STACKl doesn’t 
alter the condition number and the amount of mass of the right hand side vector that lies in the 
range space of the design matrix and the lower dimension d remains the same. However, SAMP 
APPR tends to yield larger error on approximating the certificate as we increase REPNUM, i.e., the 
coherence gets smaller. Moreover, it breaks down when the embedding dimension is very small. 



(a) |la:-a;*||2/||a;*||2 



(d) \\x-x*h/\\x*h 


10^ 
10 ° 
10 ^ 
10 ^ 



(e) i/-ri/iri 

s = 5e4 



lO'’ 10' 10 

(b) i/-fi/iri 

s = 5e3 




Figure 5: Performance of all the algorithms on NB matrices with varying n from 2.5e5 to leS 
and fixed d = 1000. The matrix is generated using STACKl. For each method, the embedding 
dimension is fixed to be 5e3 or 5e4. The following three quantities are computed: relative error 
of the objective \f — f*\/f*] relative error of the certificate \\x — x*||2/||x*||2; and the running 
time to compute the approximate solution. For each setting, 3 independent trials are performed 
and the median is reported. 


5.3.5 Performance of low-precision solvers when d changes 

Here, we evaluate the performance of the low-precision solvers by evaluating the performance of 
all the embeddings on NB matrices with changing d. We fix n = le7 and let d take values from 
10 to 2000. For each d, the matrix is generated by stacking an NB matrix with size 2.5e5 by d 
40 times using STACKl, so that the coherence of the matrix is 1/40. For conciseness, we fix the 
embedding of each method to be 2e3 or 5e4. The relative error on certificate and objective and 
running time are evaluated. The results are shown in Figure El 

As can be seen, overall, all the projection-based methods behave similarly. As expected, the 
relative error goes up as d gets larger. Meanwhile, SAMP APPR yields lower error as d increases. 
However, it seems to have a stronger dependence on the lower dimension of the matrix, as it 
breaks down when d is 100 for small sampling size, i.e., s = 2e3. 
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Figure 6: Performance of all the algorithms on NB matrices with varying d from 10 to 2000 
and fixed n = le7. The matrix is generated using STACK 1. For each method, the embedding 
dimension is fixed to be 2e3 or 5e4. The following three quantities are computed: relative error 
of the objective \f — f*\/f*] relative error of the certificate ||x — a;*||2/||a:*||2; and the running 
time to compute the approximate solution. For each setting, 3 independent trials are performed 
and the median is reported. 


5.3.6 Performance of high precision solvers 

Here, we evaluate the use of these methods as preconditioners for high-precision iterative solvers. 
Since the embedding can be used to compute a preconditioner for the original linear system, 
one can invoke iterative algorithms such as LSQR [l3] to solve the preconditioned least-squares 
problem. Here, we will use LSQR. We first evaluate the conditioning quality, i.e., k{AR~^), on 
an NB matrix with size le6 by 500 using several different ways for computing the embedding. 
The results are presented in Table [T3l Then we test the performance of LSQR with these precon¬ 
ditioners on an NB matrix with size leS by 1000 and an NG matrix with size le7 by 1000. For 
simplicity, for each method of computing the embedding, we try a small embedding dimension 
where some of the methods fail, and a large embedding dimension where most of the methods 
succeed. See Figure [7] and Figure [8] for details. 

The convergence rate of the LSQR phase depends on the preconditioning quality, i.e., k{AR~^) 
where R is obtained by the QR decomposition of the embedding of A, ^A. See Section 14.21 for 
more details. Table [T3] implies that all the projection-based methods tend to yield preconditioners 
with similar condition numbers once the embedding dimension is large enough. Among them, 
PROJ CW needs a much larger embedding dimension to be reliable (clearly consistent with its 
use in low-precision solvers). In addition, overall, the conditioning quality of the sampling-based 
embedding method, i.e., SAMP APPR tends to be worse than that of projection-based methods. 

As for the downstream performance, from Figure [7] we can clearly see that, when a small 
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embedding dimension is used, i.e., s = 5e3, PROJ GAUSSIAN yields the best preconditioner, as its 
better preconditioning quality translates immediately into fewer iterations for LSQR to converge. 
This is followed by SAMP APPR. This relative order is also suggested by Table [T^ As the embedding 
dimension is increased, i.e., using large embedding dimension, all the method yield significant 
improvements and produce much more accurate solutions compared to that of NOCO (LSQR 
without preconditioning), among which PROJ CW with embedding dimension 3e5 converges to 
a nearly machine-precision solution within only 5 iterations. As for the running time, since 
each iteration of LSQR only involves with two matrix-vector multiplications (costs less than 2 
minutes in our experiments), the overall running time is dominated by the time for computing the 
preconditioner. As expected, PROJ CW runs the fastest and the running time of PROJ GAUSSIAN 
scales linearly in the embedding dimension. In SAMP APPR, the sampling process needs to make 
1-2 passes over the dataset but the running time is relatively stable regardless of the sampling 
size, as reflected in Figure [7l(c)&:(f). Finally, note that the reason that the error does not drop 
monotonically in the solution vector is the following. With the preconditioners, we work on a 
transformed system, and the theory only guarantees monotonicity in the decreasing of the relative 
error of the certificate of the transformed system, not the original one. 

Finally, a minor but potentially important point should be mentioned as a word of caution. As 
expected, when the condition number of the linear system is large, vanilla LSQR does not converge 
at all. On the other hand, when the condition number is very small, from Figure [HI there is no 
need to precondition. If, in this latter case, a randomized preconditioning method is used, then 
the embedding dimension must be chosen to be sufficiently large: unless the embedding dimension 
is large enough such that the conditioning quality is sufficiently good, then preconditioned LSQR 
yields larger errors than even vanilla LSQR. 


c 

PROJ CW 

PROJ GAUSSIAN 

PROJ RADEMACHER 

PROJ SRDHT 

SAMP APPR 

5e2 

1.08e8 

2.17e3 

1.42e3 

1.19e2 

1.21e2 

1e3 

1.1e6 

5.7366 

5.6006 

7.1958 

75.0290 

5e3 

5.5e5 

1.9059 

1.9017 

1.9857 

25.8725 

1e4 

5.1e5 

1.5733 

1.5656 

1.6167 

17.0679 

5e4 

1.8e5 

1.2214 

1.2197 

1.2293 

6.9109 

1e5 

1.1376 

1.1505 

1.1502 

1.1502 

4.7573 


Table 13: Quality of preconditioning on an NB matrix with size le6 by 500 using several kinds 
of embeddings. For each setting, 5 independent trials are performed and the median is reported. 


6 Discussion and conclusion 

Large-scale data analysis and machine learning problems present considerable challenges and 
opportunities to signal processing, electrical engineering, scientific computing, numerical linear 
algebra, and other research areas that have historically been developers of and/or consumers of 
high-quality matrix algorithms. RandNLA is an approach, originally from theoretical computer 
science, that uses randomization as a resource for the development of improved matrix algorithms, 
and it has had several remarkable successes in theory and in practice in small to medium scale 
matrix computations in RAM. The general design strategy of RandNLA algorithms (for problems 
such as £2 regression and low-rank matrix approximation) in RAM is by now well known: con¬ 
struct a sketch (either by performing a random projection or by random sampling according to a 
judiciously-chosen data-dependent importance sampling probability distribution), and then use 
that sketch to approximate the solution to the original problem (either by solving a subproblem 
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small embedding dimension 



(c) Running tinie(sec) 




(d) \\x-x*h/\\x*h (e) i/-ri/iri 

large embedding dimension 



(f) Running time(sec) 


Figure 7: Evaluation of LSQR with randomized preconditioner on an NB matrix with size le8 
by 1000 and condition number le6. Here, several ways for computing the embedding are imple¬ 
mented. In SAMP APPR, the underlying random projection is PROJ CW with projection dimension 
3e5. For completeness, LSQR without preconditioner is evaluated, denoted by NOCO. In above, by 
small embedding dimension, we mean 5e3 for all the methods. By large embedding dimension, 
we mean 3e5 for PROJ CW, le4 for PROJ GAUSSIAN and 5e4 for SAMP APPR. For each method and 
embedding dimension, the following three quantities are computed: relative error of the objective 
I/ —/*!//*; relative error of the certificate ||x —a:*||2 /||x*|| 2; and the running time to compute the 
approximate solution. Each subplot shows one of the above quantities versus number of iteration, 
respectively. For each setting, only one trial is performed. 


on the sketch or using the sketch to construct a preconditioner for the original problem). 

The work reviewed here highlights how, with appropriate modifications, similar design strate¬ 
gies can extend (for £2-based regression problems as well as other problems such as £i-based regres¬ 
sion problems) to much larger-scale parallel and distributed environments that are increasingly 
common. Importantly, though, the improved scalability often comes due to restricted communi¬ 
cations, rather than improvements in FLOPS. (For example, the use of Chebyshev semi-iterative 
method vs. LSQR in LSRN on MPI; and the use of the MIE with multiple queries on MapReduce.) 
In these parallel/distributed settings, we can take advantage of the communication-avoiding na¬ 
ture of RandNLA algorithms to move beyond FLOPS to design matrix algorithms that use more 
computation than the traditional algorithms but that have much better communication profiles, 
and we can do this by mapping the RandNLA algorithms to the underlying architecture in very 
nontrivial ways. (For example, using more computationally-expensive Gaussian projections to 
ensure stronger control on the condition number; and using the MIE with multiple initial queries 
to construct a very good initial search region.) These examples of performing extra computation 
to develop algorithms with improved communication suggests revisiting other methods from nu¬ 
merical linear algebra, optimization, and scientific computing, looking in other novel ways beyond 
FLOPS for better communication properties for many large-scale matrix algorithms. 
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small embedding dimension 
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large embedding dimension 
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Figure 8: Evaluation of LSQR with randomized preconditioner on an NB matrix with size le7 by 
1000 and condition number 5. Here, several ways for computing the embedding are implemented. 
In SAMP APPR, the underlying random projection is PROJ CW with projection dimension 3e5. For 
completeness, LSQR without preconditioner is evaluated, denoted by NOCO. In above, by small 
embedding dimension, we mean 5e3 for all the methods. By large embedding dimension, we 
mean 3e5 for PROJ CW and 5e4 for the rest. For each method and embedding dimension, the 
following three quantities are computed: relative error of the objective \ f — f*\/f*', relative error 
of the certificate ||a: — fc*||2/||a^*||2; and the running time to compute the approximate solution. 
Each subplot shows one of the above quantities versus number of iteration, respectively. Eor each 
setting, 3 independent trials are performed and the median is reported. 
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